I recently had to bring up an old lab environment that was using an older version of vRA (8.8), but at appliance boot the Kubernetes API-Server containers continuously rebooted and I got an error with kubectl. This post covers some of the troubleshooting steps taken and how it was resolved, ultimately it was caused by an expired certificate.

kube-apiserver container existing and disappearing due to startup issues
kube-apiserver container existing and disappearing due to startup issues

I checked the service status for both kubelet and docker, both were active and running. Next I checked kubelet with journalctl - journalctl -u kubelet -r, the -r flag shows us the newest messages first.

There was a lot of repetitive logs stating the DNS name of our node was not found, but in amongst the noise there was also TLS and certificate related messages, include the path to the pem file that it was trying to use.

Kubelet service status with node not found logs
Kubelet service status with node not found logs

Kubelet service status with TLS/certificate errors in the logs
Kubelet service status with TLS/certificate errors in the logs

I then used an OpenSSL command to check the validity of the certificate: sudo openssl x509 -in /var/lib/kubelet/pki/kubelet-server-current.pem -text -noout | grep -A 2 Validity - this showed us that the certificate had expired.

openssl command to show the certificate validity
openssl command to show the certificate validity

After some digging I came across the following VMware KB article, although the symptoms were not the same it did mention that the certificate had expired after a year, which did match what I was seeing - https://kb.vmware.com/s/article/82378

I am using a single vRA Appliance, so I followed the relevant steps within the KB, however I had some issues, which is the purpose of this post, the adjustments I made are below, along with the exact commands I ran…

  • Take a snapshot of the vRA VM.
  • Locate an etcd backup at /data/etcd-backup/ and copy the selected backup to /root.
    • I ran ls -lh /data/etcd-backup/ - this showed me the available backups in date order (oldest to newest). I picked the most recent backup (not the .part file).
    • I copied this to root with cp /data/etcd-backup/backup-1679068501.db /root.
    • I checked the backup was in root with ls /root.
  • Reset Kubernetes by running vracli cluster leave.
    • This takes a little while to run.
  • Restore the etcd backup in /root by using the /opt/scripts/recover_etcd.sh command. Example: /opt/scripts/recover_etcd.sh --confirm /root/backup-123456789.db.
    • I ran /opt/scripts/recover_etcd.sh --confirm /root/backup-1679068501.db - the same as the example from VMware, updated with my backup file name.
  • Extract VA config from etcd with: kubectl get vaconfig -o yaml --export > /root/vaconfig.yaml.
    • This is the first step where I had issues, initially I had the error Error: unknown flag: - -export, so I removed the flag from the command.
    • I then got the same error I started with The connection to the server vra-k8s.local:6443 was refused - did you specify the right host or port?.
    • I ran through the same initial troubleshooting steps, I checked docker (all good) and kubelet (not started).
    • I started the kubelet service with systemctl start kubelet and kept an eye on the status to ensure it stayed running.
    • I check with docker ps and could see the kube-apiserver container was running so then ran the kubectl get vaconfig -o yaml > /root/vaconfig.yaml command - this time it was successful.
    • I validated that the /root/vaconfig.yaml file had contents with cat /root/vaconfig.yaml.
  • Reset Kubernetes once again using: vracli cluster leave.
    • Again this takes a little while to run.
  • Run to Install the VA config: kubectl apply -f /root/vaconfig.yaml --force.
    • The server seemed to be running after the previous leave step, but check its status at this point and wait for the kube-apiserver container to exist and be running.
  • Run vracli license to confirm that VA config is installed properly.
  • Run: /opt/scripts/deploy.sh.
    • This takes a while time to run, but you should see that prelude has been successfully deployed.
    • I had issues accessing the vRA UI, so I rebooted and it resolved itself.

#Resources

Share to TwitterShare to FacebookShare to LinkedinShare to PocketCopy URL address
Written by

Sam Perrin@samperrin

Automation Consultant, currently working at Xtravirt. Interested in all things automation/devops related.

Related Posts