I wanted to deploy Concourse CI in to my Azure AKS cluster, which I hadn’t touched for a while, and encountered the error Failed to provision volume with StorageClass "default"
.
This post will cover my troubleshooting steps and ultimately the resolution - which was to update/rotate my Azure AKS credentials.
I followed the HELM install steps from Concourse CI https://github.com/concourse/concourse-chart.
#Troubleshooting Steps
To start I checked the status of the Pods deployed, they all showed Pending: kubectl get pods
.
NAMESPACE NAME READY STATUS RESTARTS AGE
default concourse-ci-postgresql-0 0/1 Pending 0 5m11s
default concourse-ci-web-d6bc9f97d-pr5zd 0/1 Running 1 5m11s
default concourse-ci-worker-0 0/1 Pending 0 5m11s
default concourse-ci-worker-1 0/1 Pending 0 5m11s
I investigated the logs from pod concourse-ci-web-d6bc9f97d-pr5zd: kubectl logs concourse-ci-web-d6bc9f97d-pr5zd
.
These showed an error trying to reach the database. "error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host"
{"timestamp":"2020-12-07T10:57:08.526432188Z","level":"info","source":"atc","message":"atc.cmd.start","data":{"session":"1"}}
{"timestamp":"2020-12-07T10:57:08.561956604Z","level":"error","source":"atc","message":"atc.db.failed-to-open-db-retrying","data":{"error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host","session":"3"}}
{"timestamp":"2020-12-07T10:57:13.671634392Z","level":"error","source":"atc","message":"atc.db.failed-to-open-db-retrying","data":{"error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host","session":"3"}}
{"timestamp":"2020-12-07T10:57:18.707090349Z","level":"error","source":"atc","message":"atc.db.failed-to-open-db-retrying","data":{"error":"dial tcp: lookup concourse-ci-postgresql on 10.0.0.10:53: no such host","session":"3"}}
Due to the database connection error above, I looked in to the concourse-ci-postgresql-0 pod. I did a describe
this time to see what Events were being produced: kubectl describe pod concourse-ci-postgresql-0
.
The top event shows us that it is having issues with scheduling and binding to PersistentVolumeClaims: Warning FailedScheduling 11m default-scheduler running "VolumeBinding" filter plugin for pod "concourse-ci-postgresql-0": pod has unbound immediate PersistentVolumeClaims
.
Name: concourse-ci-postgresql-0
Namespace: default
Priority: 0
Node: <none>
Labels: app.kubernetes.io/instance=concourse-ci
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=postgresql
controller-revision-hash=concourse-ci-postgresql-5c7bfbd7bb
helm.sh/chart=postgresql-9.2.0
role=master
statefulset.kubernetes.io/pod-name=concourse-ci-postgresql-0
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: StatefulSet/concourse-ci-postgresql
Containers:
concourse-ci-postgresql:
Image: docker.io/bitnami/postgresql:11.8.0-debian-10-r76
Port: 5432/TCP
Host Port: 0/TCP
Requests:
cpu: 250m
memory: 256Mi
Liveness: exec [/bin/sh -c exec pg_isready -U "concourse" -d "dbname=concourse" -h 127.0.0.1 -p 5432] delay=30s timeout=5s period=10s #success=1 #failure=6
Readiness: exec [/bin/sh -c -e exec pg_isready -U "concourse" -d "dbname=concourse" -h 127.0.0.1 -p 5432
[ -f /opt/bitnami/postgresql/tmp/.initialized ] || [ -f /bitnami/postgresql/.initialized ]
] delay=5s timeout=5s period=10s #success=1 #failure=6
Environment:
BITNAMI_DEBUG: false
POSTGRESQL_PORT_NUMBER: 5432
POSTGRESQL_VOLUME_DIR: /bitnami/postgresql
PGDATA: /bitnami/postgresql/data
POSTGRES_USER: concourse
POSTGRES_PASSWORD: <set to the key 'postgresql-password' in secret 'concourse-ci-postgresql'> Optional: false
POSTGRESQL_ENABLE_LDAP: no
POSTGRESQL_ENABLE_TLS: no
Mounts:
/bitnami/postgresql from data (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-zpr9z (ro)
Type Status
PodScheduled False
Volumes:
data:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: data-concourse-ci-postgresql-0
ReadOnly: false
dshm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: 1Gi
default-token-zpr9z:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-zpr9z
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 11m default-scheduler running "VolumeBinding" filter plugin for pod "concourse-ci-postgresql-0": pod has unbound immediate PersistentVolumeClaims
Warning FailedScheduling 11m default-scheduler running "VolumeBinding" filter plugin for pod "concourse-ci-postgresql-0": pod has unbound immediate PersistentVolumeClaims
Next step was to have a look at the PersistentVolumeClaims: kubectl get pvc
.
Again, like the pods, these were all Pending.
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
concourse-work-dir-concourse-ci-worker-0 Pending default 14m
concourse-work-dir-concourse-ci-worker-1 Pending default 14m
data-concourse-ci-postgresql-0 Pending default 14m
I investigated one of the PVC’s: kubectl describe pvc concourse-work-dir-concourse-ci-worker-0
.
Within the messages we can see quite clearly what the problem is: Failed to provision volume with StorageClass "default": Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to http://localhost:7788/subscriptions/8af15f99-1234-43e2-8876-55bd7e6a727b/resourceGroups/mc_development_uksouth/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d?api-version=2019-07-01
Name: concourse-work-dir-concourse-ci-worker-0
Namespace: default
StorageClass: default
Status: Pending
Volume:
Labels: app=concourse-ci-worker
release=concourse-ci
Annotations: volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers: [kubernetes.io/pvc-protection]
Capacity:
Access Modes:
VolumeMode: Filesystem
Mounted By: concourse-ci-worker-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ProvisioningFailed 15m persistentvolume-controller Failed to provision volume with StorageClass "default": Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: Retriable: false, RetryAfter: 0s, HTTPStatusCode: 401, RawError: azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to http://localhost:7788/subscriptions/8af15f99-1234-43e2-8876-55bd7e6a727b/resourceGroups/mc_development_uksouth/providers/Microsoft.Compute/disks/kubernetes-dynamic-pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d?api-version=2019-07-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {"error":"invalid_client","error_description":"AADSTS7000222: The provided client secret keys are expired. Visit the Azure Portal to create new keys for your app, or consider using certificate credentials for added security: https://docs.microsoft.com/azure/active-directory/develop/active-directory-certificate-credentials\r\nTrace ID: 8f13a52c-9f01-4012-9da2-4127e054cf00\r\nCorrelation ID: 054bec33-1b44-456f-b91b-4cb8bc4967e1\r\nTimestamp: 2020-12-07 10:55:24Z","error_codes":[7000222],"timestamp":"2020-12-07 10:55:24Z","trace_id":"8f13a52c-9f01-4012-9da2-4127e054cf00","correlation_id":"054bec33-1b44-456f-b91b-4cb8bc4967e1","error_uri":"https://login.microsoftonline.com/error?code=7000222"}
Next step was to resolve the storage provisioning issues. Using information from the error above, and some Googling, I came across the following GitHub issue which showed the same error we were recieving https://github.com/Azure/AKS/issues/222.
This GitHub issue included a comment to the following Microsoft documentation page for updating credentials. https://docs.microsoft.com/en-us/azure/aks/update-credentials.
After completing the steps provided, I checked the same PVC as earlier to check on its status: kubectl describe pvc concourse-work-dir-concourse-ci-worker-0
. We can now see that the volume has been provisioned: Successfully provisioned volume pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d using kubernetes.io/azure-disk
Name: concourse-work-dir-concourse-ci-worker-0
Namespace: default
StorageClass: default
Status: Bound
Volume: pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d
Labels: app=concourse-ci-worker
release=concourse-ci
Annotations: pv.kubernetes.io/bind-completed: yes
pv.kubernetes.io/bound-by-controller: yes
volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/azure-disk
Finalizers: [kubernetes.io/pvc-protection]
Capacity: 20Gi
Access Modes: RWO
VolumeMode: Filesystem
Mounted By: concourse-ci-worker-0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ProvisioningSucceeded 66s persistentvolume-controller Successfully provisioned volume pvc-f52878cf-5fb4-4de5-82f0-e9b284b29a9d using kubernetes.io/azure-disk
Checking the status of the pods again, we can see they are now Running/Creating: kubectl get pod
.
NAME READY STATUS RESTARTS AGE
concourse-ci-postgresql-0 0/1 ContainerCreating 0 46m
concourse-ci-web-d6bc9f97d-pr5zd 0/1 Running 14 46m
concourse-ci-worker-0 1/1 Running 0 46m
concourse-ci-worker-1 0/1 Pending 0 46m
#References
https://github.com/Azure/AKS/issues/222 \ https://docs.microsoft.com/en-us/azure/aks/update-credentials