Azure AKS - Failed to provision volume with StorageClass "default"

I wanted to deploy Concourse CI in to my Azure AKS cluster, which I hadn’t touched for a while, and encountered the error Failed to provision volume with StorageClass "default".

This post will cover my troubleshooting steps and ultimately the resolution - which was to update/rotate my Azure AKS credentials.

I followed the HELM install steps from Concourse CI https://github.com/concourse/concourse-chart.

#Troubleshooting Steps

To start I checked the status of the Pods deployed, they all showed Pending: kubectl get pods.

NAMESPACE     NAME                                         READY   STATUS    RESTARTS   AGE
default       concourse-ci-postgresql-0                    0/1     Pending   0          5m11s
default       concourse-ci-web-d6bc9f97d-pr5zd             0/1     Running   1          5m11s
default       concourse-ci-worker-0                        0/1     Pending   0          5m11s
default       concourse-ci-worker-1                        0/1     Pending   0          5m11s

I investigated the logs from pod concourse-ci-web-d6bc9f97d-pr5zd: kubectl logs concourse-ci-web-d6bc9f97d-pr5zd.

These showed an error trying to reach the database. "error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host"

{"timestamp":"2020-12-07T10:57:08.526432188Z","level":"info","source":"atc","message":"atc.cmd.start","data":{"session":"1"}}
{"timestamp":"2020-12-07T10:57:08.561956604Z","level":"error","source":"atc","message":"atc.db.failed-to-open-db-retrying","data":{"error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host","session":"3"}}
{"timestamp":"2020-12-07T10:57:13.671634392Z","level":"error","source":"atc","message":"atc.db.failed-to-open-db-retrying","data":{"error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host","session":"3"}}
{"timestamp":"2020-12-07T10:57:18.707090349Z","level":"error","source":"atc","message":"atc.db.failed-to-open-db-retrying","data":{"error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host","session":"3"}}

Due to the database connection error above, I looked in to the concourse-ci-postgresql-0 pod. I did a describe this time to see what Events were being produced: kubectl describe pod concourse-ci-postgresql-0.

The top event shows us that it is having issues with scheduling and binding to PersistentVolumeClaims: Warning FailedScheduling 11m default-scheduler running "VolumeBinding" filter plugin for pod "concourse-ci-postgresql-0": pod has unbound immediate PersistentVolumeClaims.

Name:           concourse-ci-postgresql-0
Namespace:      default
Priority:       0
Node:           <none>
Labels:         app.kubernetes.io/instance=concourse-ci
                app.kubernetes.io/managed-by=Helm
                app.kubernetes.io/name=postgresql
                controller-revision-hash=concourse-ci-postgresql-5c7bfbd7bb
                helm.sh/chart=postgresql-9.2.0
                role=master
                statefulset.kubernetes.io/pod-name=concourse-ci-postgresql-0
Annotations:    <none>
Status:         Pending
IP:
IPs:            <none>
Controlled By:  StatefulSet/concourse-ci-postgresql
Containers:
  concourse-ci-postgresql:
    Image:      docker.io/bitnami/postgresql:11.8.0-debian-10-r76
    Port:       5432/TCP
    Host Port:  0/TCP
    Requests:
      cpu:      250m
      memory:   256Mi
    Liveness:   exec [/bin/sh -c exec pg_isready -U "concourse" -d "dbname=concourse" -h 127.0.0.1 -p 5432] delay=30s timeout=5s period=10s #success=1 #failure=6
    Readiness:  exec [/bin/sh -c -e exec pg_isready -U "concourse" -d "dbname=concourse" -h 127.0.0.1 -p 5432
[ -f /opt/bitnami/postgresql/tmp/.initialized ] || [ -f /bitnami/postgresql/.initialized ]
] delay=5s timeout=5s period=10s #success=1 #failure=6
    Environment:
      BITNAMI_DEBUG:           false
      POSTGRESQL_PORT_NUMBER:  5432
      POSTGRESQL_VOLUME_DIR:   /bitnami/postgresql
      PGDATA:                  /bitnami/postgresql/data
      POSTGRES_USER:           concourse
      POSTGRES_PASSWORD:       <set to the key 'postgresql-password' in secret 'concourse-ci-postgresql'>  Optional: false
      POSTGRESQL_ENABLE_LDAP:  no
      POSTGRESQL_ENABLE_TLS:   no
    Mounts:
      /bitnami/postgresql from data (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-zpr9z (ro)
  Type           Status
  PodScheduled   False
Volumes:
  data:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  data-concourse-ci-postgresql-0
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  default-token-zpr9z:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-zpr9z
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  11m   default-scheduler  running "VolumeBinding" filter plugin for pod "concourse-ci-postgresql-0": pod has unbound immediate PersistentVolumeClaims
  Warning  FailedScheduling  11m   default-scheduler  running "VolumeBinding" filter plugin for pod "concourse-ci-postgresql-0": pod has unbound immediate PersistentVolumeClaims

Next step was to have a look at the PersistentVolumeClaims: kubectl get pvc.

Again, like the pods, these were all Pending.

NAME                                       STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
concourse-work-dir-concourse-ci-worker-0   Pending                                      default        14m
concourse-work-dir-concourse-ci-worker-1   Pending                                      default        14m
data-concourse-ci-postgresql-0             Pending                                      default        14m

I investigated one of the PVC’s: kubectl describe pvc concourse-work-dir-concourse-ci-worker-0.

Within the messages we can see quite clearly what the problem is: Failed to provision volume with StorageClass "default": Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to http://localhost:7788/subscriptions/8af15f99-1234-43e2-8876-55bd7e6a727b/resourceGroups/mc_development_uksouth/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d?api-version=2019-07-01

Name:          concourse-work-dir-concourse-ci-worker-0
Namespace:     default
StorageClass:  default
Status:        Pending
Volume:
Labels:        app=concourse-ci-worker
               release=concourse-ci
Annotations:   volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode:    Filesystem
Mounted By:    concourse-ci-worker-0
Events:
  Type     Reason              Age                  From                         Message
  ----     ------              ----                 ----                         -------
  Warning  ProvisioningFailed  15m                  persistentvolume-controller  Failed to provision volume with StorageClass "default": Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to http://localhost:7788/subscriptions/8af15f99-1234-43e2-8876-55bd7e6a727b/resourceGroups/mc_development_uksouth/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d?api-version=2019-07-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS7000222: The provided client secret keys are expired. Visit the Azure Portal to create new keys for your app, or consider using certificate credentials for added security: https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials\r\nTrace ID: 8f13a52c-9f01-4012-9da2-4127e054cf00\r\nCorrelation ID: 054bec33-1b44-456f-b91b-4cb8bc4967e1\r\nTimestamp: 2020-12-07 10:55:24Z","error_codes":[7000222],"timestamp":"2020-12-07 10:55:24Z","trace_id":"8f13a52c-9f01-4012-9da2-4127e054cf00","correlation_id":"054bec33-1b44-456f-b91b-4cb8bc4967e1","error_uri":"https://login.microsoftonline.com/error?code=7000222"}

Next step was to resolve the storage provisioning issues. Using information from the error above, and some Googling, I came across the following GitHub issue which showed the same error we were recieving https://github.com/Azure/AKS/issues/222.

This GitHub issue included a comment to the following Microsoft documentation page for updating credentials. https://docs.microsoft.com/en-us/azure/aks/update-credentials.

After completing the steps provided, I checked the same PVC as earlier to check on its status: kubectl describe pvc concourse-work-dir-concourse-ci-worker-0. We can now see that the volume has been provisioned: Successfully provisioned volume pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d using kubernetes.io/azure-disk

Name:          concourse-work-dir-concourse-ci-worker-0
Namespace:     default
StorageClass:  default
Status:        Bound
Volume:        pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d
Labels:        app=concourse-ci-worker
               release=concourse-ci
Annotations:   pv.kubernetes.io/bind-completed: yes
               pv.kubernetes.io/bound-by-controller: yes
               volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      20Gi
Access Modes:  RWO
VolumeMode:    Filesystem
Mounted By:    concourse-ci-worker-0
Events:
  Type     Reason                 Age                   From                         Message
  ----     ------                 ----                  ----                         -------
  Normal   ProvisioningSucceeded  66s                   persistentvolume-controller  Successfully provisioned volume pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d using kubernetes.io/azure-disk

Checking the status of the pods again, we can see they are now Running/Creating: kubectl get pod.

NAME                               READY   STATUS              RESTARTS   AGE
concourse-ci-postgresql-0          0/1     ContainerCreating   0          46m
concourse-ci-web-d6bc9f97d-pr5zd   0/1     Running             14         46m
concourse-ci-worker-0              1/1     Running             0          46m
concourse-ci-worker-1              0/1     Pending             0          46m

#References

https://github.com/Azure/AKS/issues/222 \ https://docs.microsoft.com/en-us/azure/aks/update-credentials

Azure AKS - Failed to provision volume with StorageClass "default"

#Troubleshooting Steps

#References

Written by

Sam Perrin@samperrin

Related Posts

Tanzu Community Edition - Azure Cluster

Tanzu Community Edition - Tips and Concepts

Custom Naming in vRA8