Troubleshooting Tools¶
Help! Something is wrong with my Compliant Kubernetes cluster. Fear no more, this guide will help you make sense.
This guide assumes that:
- You have pre-requisites installed.
- Your environment variables, in particular
CK8S_CONFIG_PATH
is set. - Your config folder (e.g. for OpenStack) is available.
compliantkubernetes-apps
andcompliantkubernetes-kubespray
is available.
I have no clue where to start¶
If you get lost, start checking from the "physical layer" and up.
Are the Nodes still accessible via SSH?¶
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini all -m ping
done
Are the Nodes "doing fine"?¶
Dmesg should not display unexpected messages. OOM will show up here.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; dmesg | tail -n 10'
done
Uptime should show high uptime (e.g., days) and low load (e.g., less than 3):
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; uptime'
done
Any process that uses too much CPU?
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; ps -Ao user,uid,comm,pid,pcpu,tty --sort=-pcpu | head -n 6'
done
Is there enough disk space? All writeable file-systems should have at least 30% free.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; df -h'
done
Is there enough available memory? There should be at least a few GB of available memory.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; cat /proc/meminfo | grep Available'
done
Can Nodes access the Internet?
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; curl --silent https://checkip.amazonaws.com'
done
Are the Nodes having the proper time? You should see System clock synchronized: yes
and NTP service: active
.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; timedatectl status'
done
Is the base OS doing fine?¶
We generally run the latest Ubuntu LTS, at the time of this writing Ubuntu 20.04 LTS.
You can confirm this by doing:
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'cat /etc/lsb-release'
done
Are systemd units running fine? You should see running
and not degraded
.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'systemctl is-system-running'
done
Are the Kubernetes clusters doing fine?¶
Are the Nodes reporting in on Kubernetes? All Kubernetes Nodes, both control-plane and workers, should be Ready
:
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
sops exec-file ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml \
'kubectl --kubeconfig {} get nodes'
done
Is Rook doing fine?¶
If Rook is installed, is Rook doing fine? You should see HEALTH_OK
.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
sops exec-file ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml \
'kubectl --kubeconfig {} -n rook-ceph apply -f ./compliantkubernetes-kubespray/rook/toolbox-deploy.yaml'
done
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
sops exec-file ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml \
'kubectl --kubeconfig {} -n rook-ceph exec deploy/rook-ceph-tools -- ceph status'
done
Are Kubernetes Pods doing fine?¶
Pods should be Running
or Completed
, and fully Ready
(e.g., 1/1
or 6/6
)?
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
sops exec-file ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml \
'kubectl --kubeconfig {} get --all-namespaces pods'
done
Are all Deployments fine? Deployments should show all Pods Ready, Up-to-date and Available (e.g., 2/2 2 2
).
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
sops exec-file ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml \
'kubectl --kubeconfig {} get --all-namespaces deployments'
done
Are all DaemonSets fine? DaemonSets should show as many Pods Desired, Current, Ready and Up-to-date, as Desired.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
sops exec-file ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml \
'kubectl --kubeconfig {} get --all-namespaces ds'
done
Are Helm Releases fine?¶
All Releases should be deployed
.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
export KUBECONFIG=kube_config_$CLUSTER.yaml
sops -d ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml > $KUBECONFIG
helm list --all --all-namespaces
shred $KUBECONFIG
done
Is cert-manager doing fine?¶
Are (Cluster)Issuers fine? All Resources should be READY=True
or valid
.
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
export KUBECONFIG=kube_config_$CLUSTER.yaml
sops -d ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml > $KUBECONFIG
kubectl get clusterissuers,issuers,certificates,orders,challenges --all-namespaces
shred $KUBECONFIG
done
Where do I find the Nodes public and private IP?¶
find . -name inventory.ini
or
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
ansible-inventory -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini --list all
done
ansible_host
is usually the public IP, while ip
is usually the private IP.
Node cannot be access via SSH¶
Important
Make sure it is "not you". Are you well connected to the VPN? Is this the only Node which lost SSH access?
Important
If you are using Rook, it is usually set up with replication 2, which means it can tolerate one restarting Node. Make sure that, either Rook is healthy or that you are really sure you are restarting the right Node.
Try connecting to the unhealthy Node via a different Node and internal IP:
UNHEALTHY_NODE=172.0.10.205 # You lost access to this one
JUMP_NODE=89.145.xxx.yyy # You have access to this one
ssh -J ubuntu@$JUMP_NODE ubuntu@$UNHEALTHY_NODE
Try rebooting the Node via cloud provider specific CLI:
UNHEALTHY_NODE=cksc-worker-2
# Example for ExoScale
exo vm reboot --force $UNHEALTHY_NODE
If using Rook make sure its health goes back to HEALTH_OK
.
A Node has incorrect time¶
Incorrect time on a Node can have sever consequences with replication and monitoring. In fact, if you follow ISO 27001, A.12.4.4 Clock Synchronisation requires you to ensure clocks are synchronized.
These days, Linux distributions should come out-of-the-box with timesyncd for time synchronization via NTP.
To figure out what is wrong, SSH into the target Node and try the following:
sudo systemctl status systemd-timesyncd
sudo journalctl --unit systemd-timesyncd
sudo timedatectl status
sudo timedatectl timesync-status
sudo timedatectl show-timesync
Possible causes include incorrect NTP server settings, or NTP being blocked by firewall. For reminder, NTP works over UDP port 123.
Node seems not fine¶
Important
If you are using Rook, it is usually set up with replication 2, which means it can tolerate one restarting Node. Make sure that, either Rook is healthy or that you are really sure you are restarting the right Node.
Try rebooting the Node:
UNHEALTHY_NODE=89.145.xxx.yyy
ssh ubuntu@$UNHEALTHY_NODE sudo reboot
If using Rook make sure its health goes back to HEALTH_OK
.
Node seems really not fine. I want a new one.¶
Is it 2AM? Do not replace Nodes, instead simply add a new one. You might run out of capacity, you might lose redundancy, you might replace the wrong Node. Prefer to add a Node and see if that solves the problem.
Okay, I want to add a new Node.¶
Prefer this option if you "quickly" need to add CPU, memory or storage (i.e., Rook) capacity.
First, check for infrastructure drift, as shown here.
Depending on your provider:
- Add a new Node by editing the
*.tfvars
. - Re-apply Terraform.
- Re-create the
inventory.ini
(skip this step if the cluster is using a dynamic inventory). - Re-apply Kubespray.
- Re-fix the Kubernetes API URL.
Check that the new Node joined the cluster, as shown here.
A systemd unit failed¶
SSH into the Node. Check which systemd unit is failing:
systemctl --failed
Gather more information:
FAILED_UNIT=fwupd-refresh.service
systemctl status $FAILED_UNIT
journalctl --unit $FAILED_UNIT
Rook seems not fine¶
Please check the following upstream documents:
Pod seems not fine¶
Before starting, set up a handy environment:
CLUSTER=cksc # Cluster containing the unhealthy Pod
export KUBECONFIG=kube_config_$CLUSTER.yaml
sops -d ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml > $KUBECONFIG
Check that you are on the right cluster:
kubectl get nodes
Find the name of the Pod which is not fine:
kubectl get pod -A
# Copy-paste the Pod and Pod namespace below
UNHEALTHY_POD=prometheus-kube-prometheus-stack-prometheus-0
UNHEALTHY_POD_NAMESPACE=monitoring
Gather some "evidence" for later diagnostics, when the heat is over:
kubectl describe pod -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
kubectl logs -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
Try to kill and check if the underlying Deployment, StatefulSet or DaemonSet will restart it:
kubectl delete pod -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
kubectl get pod -A --watch
Helm Release is failed
¶
Before starting, set up a handy environment:
CLUSTER=cksc # Cluster containing the failed Release
export KUBECONFIG=kube_config_$CLUSTER.yaml
sops -d ${CK8S_CONFIG_PATH}/.state/kube_config_$CLUSTER.yaml > $KUBECONFIG
Check that you are on the right cluster:
kubectl get nodes
Find the failed Release:
helm ls --all-namespaces --all
FAILED_RELEASE=user-rbac
FAILED_RELEASE_NAMESPACE=kube-system
Just to make sure, do a drift check, as shown here.
Remove the failed Release:
helm uninstall -n $FAILED_RELEASE_NAMESPACE $FAILED_RELEASE
Re-apply apps
according to documentation.
cert-manager is not fine¶
Follow cert-manager's troubleshooting, specifically:
Failed to perform self check: no such host¶
If with kubectl describe challenges -A
you get an error similar to below:
Waiting for HTTP-01 challenge propagation: failed to perform self check
GET request ''http://url/.well-known/acme-challenge/xVfDZoLlqs4tad2qOiCT4sjChNRausd5iNpbWuGm5ls'':
Get "http://url/.well-known/acme-challenge/xVfDZoLlqs4tad2qOiCT4sjChNRausd5iNpbWuGm5ls":
dial tcp: lookup opensearch.domain on 10.177.0.3:53: no such host'
Then you might have a DNS issue inside your cluster. Make sure that global.clusterDns
in common-config.yaml
is set to the CoreDNS Service IP returned by kubectl get svc -n kube-system coredns
.
Failed to perform self check: connection timed out¶
If with kubectl describe challenges -A
you get an error similar to below:
Reason: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA': Get "http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA": dial tcp 18.192.17.98:80: connect: connection timed out
Then your Kubernetes data plane Nodes cannot connect to themselves with the IP address of the load-balancer that fronts them. The easiest is to configure the load-balancer's IP address on the loopback interface of each Nodes. (See example here.)
How do I check if infrastructure drifted due to manual intervention?¶
Go to the docs of the cloud provider and run Terraform plan
instead of apply
. For Exoscale, it looks as follows:
TF_SCRIPTS_DIR=$(readlink -f compliantkubernetes-kubespray/kubespray/contrib/terraform/exoscale)
for CLUSTER in ${SERVICE_CLUSTER} "${WORKLOAD_CLUSTERS[@]}"; do
pushd ${TF_SCRIPTS_DIR}
export TF_VAR_inventory_file=${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini
terraform init
terraform plan \
-var-file=${CK8S_CONFIG_PATH}/${CLUSTER}-config/cluster.tfvars \
-state=${CK8S_CONFIG_PATH}/${CLUSTER}-config/terraform.tfstate
popd
done
How do I check if the Kubespray setup drifted due to manual intervention?¶
At the time of this writing, this cannot be done, but efforts are underway.
How do I check if apps
drifted due to manual intervention?¶
# For service cluster
ln -sf $CK8S_CONFIG_PATH/.state/kube_config_${SERVICE_CLUSTER}.yaml $CK8S_CONFIG_PATH/.state/kube_config_sc.yaml
./compliantkubernetes-apps/bin/ck8s ops helmfile sc diff # Respond "n" if you get WARN
# For the workload clusters
for CLUSTER in "${WORKLOAD_CLUSTERS[@]}"; do
ln -sf $CK8S_CONFIG_PATH/.state/kube_config_${CLUSTER}.yaml $CK8S_CONFIG_PATH/.state/kube_config_wc.yaml
./compliantkubernetes-apps/bin/ck8s ops helmfile wc diff # Respond "n" if you get WARN
done
Velero backup stuck in progress¶
Velero is known to get stuck InProgress
when doing backups
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-daily-backup-20211005143248 InProgress 0 0 2021-10-05 14:32:48 +0200 CEST 29d default !nobackup
First try to delete the backup
./velero backup delete velero-daily-backup-20211005143248
Then kill all the pods under the velero namespace
./compliantkubernetes-apps/bin/ck8s ops kubectl wc delete pods -n velero --all
Check that the backup is gone
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
Recreate the backup from a schedule
velero backup create --from-schedule velero-daily-backup