Troubleshooting for Platform Administrators¶

For Elastisys Self-Managed Customers

Please start by running these commands.

If you are struggling, don't hesitate to file a ticket.

You can run the following command from the compliantkubernetes-apps repository to collect diagnostic information that will help us support you.

CK8S_PGP_FP=<fingerprint provided during onboarding> ./bin/ck8s diagnostics [sc|wc]

Please also provide us with your terminal in a text format. We need to look both at the commands you typed and their output.

Help! Something is wrong with my Compliant Kubernetes cluster. Fear no more, this guide will help you make sense.

This guide assumes that:

You have pre-requisites installed.
Your environment variables, in particular CK8S_CONFIG_PATH is set, and CLUSTER set to either sc or wc.
Your config folder is available.
compliantkubernetes-apps and compliantkubernetes-kubespray is available.

Important

./bin/ck8s references the compliantkubernetes-apps CLI ./bin/ck8s-kubespray references the compliantkubernetes-kubespray CLI

Important

For some of the ansible commands below, you might require root privileges. To run commands as a privileged user with ansible, use the --become, -b flag.

Example: ansible -i inventory.ini -b all -m ping

I have no clue where to start¶

If you get lost, start checking from the "physical layer" and up.

Are the Nodes still accessible via SSH?¶

ansible -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini all -m ping

Are the Nodes "doing fine"?¶

Dmesg should not display unexpected messages. OOM will show up here.

ansible -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; dmesg | tail -n 10'

Uptime should show high uptime (e.g., days) and low load (e.g., less than 3):

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; uptime'

Any process that uses too much CPU?

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; ps -Ao user,uid,comm,pid,pcpu,tty --sort=-pcpu | head -n 6'

Is there enough disk space? All writeable file-systems should have at least 30% free.

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; df -h'

Is there enough available memory? There should be at least a few GB of available memory.

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; cat /proc/meminfo | grep Available'

Can Nodes access the Internet?

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; curl --silent  https://checkip.amazonaws.com'

Are the Nodes having the proper time? You should see System clock synchronized: yes and NTP service: active.

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; timedatectl status'

Is the base OS doing fine?¶

We generally run the latest Ubuntu LTS, at the time of this writing Ubuntu 20.04 LTS.

You can confirm this by doing:

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'cat /etc/lsb-release'

Are systemd units running fine? You should see running and not degraded.

ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'systemctl is-system-running'

Are the Kubernetes clusters doing fine?¶

Are the Nodes reporting in on Kubernetes? All Kubernetes Nodes, both control-plane and workers, should be Ready:

./bin/ck8s ops kubectl $CLUSTER get nodes

Is Rook doing fine?¶

If Rook is installed, is Rook doing fine? You should see HEALTH_OK.

export CK8S_KUBESPRAY_PATH=/path/to/compliantkubernetes-kubespray

./bin/ck8s ops kubectl $CLUSTER -n rook-ceph apply -f $CK8S_KUBESPRAY_PATH/rook/toolbox-deploy.yaml

Once the Pod is Ready run:

./bin/ck8s ops kubectl $CLUSTER -n rook-ceph exec deploy/rook-ceph-tools -- ceph status

Are Kubernetes Pods doing fine?¶

Pods should be Running or Completed, and fully Ready (e.g., 1/1 or 6/6)?

./bin/ck8s ops kubectl $CLUSTER get --all-namespaces pods

Are all Deployments fine? Deployments should show all Pods Ready, Up-to-date and Available (e.g., 2/2 2 2).

./bin/ck8s ops kubectl $CLUSTER get --all-namespaces deployments

Are all DaemonSets fine? DaemonSets should show as many Pods Desired, Current, Ready and Up-to-date, as Desired.

./bin/ck8s ops kubectl $CLUSTER get --all-namespaces ds

Are Helm Releases fine?¶

All Releases should be deployed.

./bin/ck8s ops helm $CLUSTER list --all --all-namespaces

Is cert-manager doing fine?¶

Are (Cluster)Issuers fine? All Resources should be READY=True or valid.

./bin/ck8s ops kubectl $CLUSTER get clusterissuers,issuers,certificates,orders,challenges --all-namespaces

Where do I find the Nodes public and private IP?¶

find . -name inventory.ini

or

ansible-inventory -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini --list all

ansible_host is usually the public IP, while ip is usually the private IP.

Node cannot be accessed via SSH¶

Important

Make sure it is "not you". Are you well connected to the VPN? Is this the only Node which lost SSH access?

Important

If you are using Rook, it is usually set up with replication 2, which means it can tolerate one restarting Node. Make sure that, either Rook is healthy or that you are really sure you are restarting the right Node.

Try connecting to the unhealthy Node via a different Node and internal IP:

UNHEALTHY_NODE=172.0.10.205  # You lost access to this one
JUMP_NODE=89.145.xxx.yyy  # You have access to this one

ssh -J ubuntu@$JUMP_NODE ubuntu@$UNHEALTHY_NODE

Try rebooting the Node via Infrastructure Provider specific CLI:

UNHEALTHY_NODE=cksc-worker-2

# Example for ExoScale
exo vm reboot --force $UNHEALTHY_NODE

If using Rook make sure its health goes back to HEALTH_OK.

A Node has incorrect time¶

Incorrect time on a Node can have sever consequences with replication and monitoring. In fact, if you follow ISO 27001, A.12.4.4 Clock Synchronisation requires you to ensure clocks are synchronized.

These days, Linux distributions should come out-of-the-box with timesyncd for time synchronization via NTP.

To figure out what is wrong, SSH into the target Node and try the following:

sudo systemctl status systemd-timesyncd
sudo journalctl --unit systemd-timesyncd
sudo timedatectl status
sudo timedatectl timesync-status
sudo timedatectl show-timesync

Possible causes include incorrect NTP server settings, or NTP being blocked by firewall. For reminder, NTP works over UDP port 123.

Node seems not fine¶

Important

If you are using Rook, it is usually set up with replication 2, which means it can tolerate one restarting Node. Make sure that, either Rook is healthy or that you are really sure you are restarting the right Node.

Try rebooting the Node:

UNHEALTHY_NODE=89.145.xxx.yyy

ssh ubuntu@$UNHEALTHY_NODE sudo reboot

If using Rook make sure its health goes back to HEALTH_OK.

Node seems really not fine. I want a new one¶

Is it 2AM? Do not replace Nodes, instead simply add a new one. You might run out of capacity, you might lose redundancy, you might replace the wrong Node. Prefer to add a Node and see if that solves the problem.

Okay, I want to add a new Node¶

Prefer this option if you "quickly" need to add CPU, memory or storage (i.e., Rook) capacity.

First, check for infrastructure drift, as shown here.

Depending on your provider: If the infrastructure is not managed by terraform you can skip to step 3:

Add a new Node by editing the *.tfvars.
Re-apply Terraform.
Add the new node to the inventory.ini (skip this step if the cluster is using a dynamic inventory).

Re-apply Kubespray only for the new node.

cd [compliantkubernetes-kubespray-root-dir]

CLUSTER=[sc | wc]

./bin/ck8s-kubespray run-playbook $CLUSTER facts.yml
./bin/ck8s-kubespray run-playbook $CLUSTER scale.yml -b --limit=[new_node_name]

Add ssh keys to the new node if necessary

./bin/ck8s-kubespray apply-ssh $CLUSTER --limit=[new_node_name]

Update Network Policies

cd [compliatkubernetes-apps-root-dir]

./bin/ck8s update-ips sc update
./bin/ck8s update-ips wc update

./bin/ck8s ops helmfile sc -l app=common-np -i apply
./bin/ck8s ops helmfile wc -l app=common-np -i apply

./bin/ck8s ops helmfile sc -l app=service-cluster-np -i apply
# or
./bin/ck8s ops helmfile wc -l app=workload-cluster-np -i apply

Check that the new Node joined the cluster, as shown here.

A systemd unit failed¶

SSH into the Node. Check which systemd unit is failing:

systemctl --failed

Gather more information:

FAILED_UNIT=fwupd-refresh.service

systemctl status $FAILED_UNIT
journalctl --unit $FAILED_UNIT

Rook seems not fine¶

Please check the following upstream documents:

Pod seems not fine¶

Make sure you are on the right cluster:

echo $CK8S_CONFIG_PATH
echo $CLUSTER

Find the name of the Pod which is not fine:

./bin/ck8s ops kubectl $CLUSTER get pod -A

# Copy-paste the Pod and Pod namespace below
UNHEALTHY_POD=prometheus-kube-prometheus-stack-prometheus-0
UNHEALTHY_POD_NAMESPACE=monitoring

Gather some "evidence" for later diagnostics, when the heat is over:

./bin/ck8s ops kubectl $CLUSTER describe pod -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
./bin/ck8s ops kubectl $CLUSTER logs -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD

Try to kill and check if the underlying Deployment, StatefulSet or DaemonSet will restart it:

./bin/ck8s ops kubectl $CLUSTER delete pod -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
./bin/ck8s ops kubectl $CLUSTER get pod -A --watch

Helm Release is `failed`¶

Make sure you are on the right cluster:

echo $CK8S_CONFIG_PATH
echo $CLUSTER

Find the failed Release:

./bin/ck8s ops helm $CLUSTER ls --all-namespaces --all

FAILED_RELEASE=user-rbac
FAILED_RELEASE_NAMESPACE=kube-system

Just to make sure, do a drift check, as shown here.

Remove the failed Release:

./bin/ck8s ops helm $CLUSTER uninstall -n $FAILED_RELEASE_NAMESPACE $FAILED_RELEASE

Re-apply apps according to documentation.

cert-manager is not fine¶

Follow cert-manager's troubleshooting, specifically:

Failed to perform self check: no such host¶

If with kubectl describe challenges -A you get an error similar to below:

Waiting for HTTP-01 challenge propagation: failed to perform self check
    GET request ''http://url/.well-known/acme-challenge/xVfDZoLlqs4tad2qOiCT4sjChNRausd5iNpbWuGm5ls'':
    Get "http://url/.well-known/acme-challenge/xVfDZoLlqs4tad2qOiCT4sjChNRausd5iNpbWuGm5ls":
    dial tcp: lookup opensearch.domain on 10.177.0.3:53: no such host'

Then you might have a DNS issue inside your cluster. Make sure that global.clusterDns in common-config.yaml is set to the CoreDNS Service IP returned by kubectl get svc -n kube-system coredns.

Failed to perform self check: connection timed out¶

If with kubectl describe challenges -A you get an error similar to below:

Reason: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA': Get "http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA": dial tcp 18.192.17.98:80: connect: connection timed out

Then your Kubernetes data plane Nodes cannot connect to themselves with the IP address of the load-balancer that fronts them. The easiest is to configure the load-balancer's IP address on the loopback interface of each Nodes. (See example here.)

How do I check if infrastructure drifted due to manual intervention?¶

Go to the docs of the Infrastructure Provider and run Terraform plan instead of apply. For Exoscale, it looks as follows:

TF_SCRIPTS_DIR=$(readlink -f compliantkubernetes-kubespray/kubespray/contrib/terraform/exoscale)
pushd ${TF_SCRIPTS_DIR}
export TF_VAR_inventory_file=${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini
terraform init
terraform plan \
    -var-file=${CK8S_CONFIG_PATH}/${CLUSTER}-config/cluster.tfvars \
    -state=${CK8S_CONFIG_PATH}/${CLUSTER}-config/terraform.tfstate
popd

How do I check if the Kubespray setup drifted due to manual intervention?¶

At the time of this writing, this cannot be done, but efforts are underway.

How do I check if `apps` drifted due to manual intervention?¶

# For Management Cluster
./bin/ck8s ops helmfile sc diff  # Respond "n" if you get WARN

# For the Workload Clusters
./bin/ck8s ops helmfile wc diff  # Respond "n" if you get WARN

Velero backup stuck in progress¶

Velero is known to get stuck InProgress when doing backups

velero backup get

NAME                                 STATUS             ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR
velero-daily-backup-20211005143248   InProgress         0        0          2021-10-05 14:32:48 +0200 CEST   29d       default            !nobackup

First try to delete the backup

./velero backup delete velero-daily-backup-20211005143248

Then kill all the pods under the velero namespace

./bin/ck8s ops kubectl wc delete pods -n velero --all

Check that the backup is gone

velero backup get

NAME                                 STATUS             ERRORS   WARNINGS   CREATED                          EXPIRES   STORAGE LOCATION   SELECTOR

Recreate the backup from a schedule

velero backup create --from-schedule velero-daily-backup

How do I use `kubectl` and `helm` directly?¶

This guide makes heavy use of the compliantkubernetes-apps CLI to access and control Compliant Kubernetes clusters. However, you can use kubectl and helm directly, by exporting a KUBECONFIG like so:

export KUBECONFIG=${CK8S_CONFIG_PATH}/.state/kube_config_${CLUSTER}.yaml

kubectl get pods -A

helm list -A

Troubleshooting for Platform Administrators¶

I have no clue where to start¶

Are the Nodes still accessible via SSH?¶

Are the Nodes "doing fine"?¶

Is the base OS doing fine?¶

Are the Kubernetes clusters doing fine?¶

Is Rook doing fine?¶

Are Kubernetes Pods doing fine?¶

Are Helm Releases fine?¶

Is cert-manager doing fine?¶

Where do I find the Nodes public and private IP?¶

Node cannot be accessed via SSH¶

A Node has incorrect time¶

Node seems not fine¶

Node seems really not fine. I want a new one¶

Okay, I want to add a new Node¶

A systemd unit failed¶

Rook seems not fine¶

Pod seems not fine¶

Helm Release is failed¶

cert-manager is not fine¶

Failed to perform self check: no such host¶

Failed to perform self check: connection timed out¶

How do I check if infrastructure drifted due to manual intervention?¶

How do I check if the Kubespray setup drifted due to manual intervention?¶

How do I check if apps drifted due to manual intervention?¶

Velero backup stuck in progress¶

How do I use kubectl and helm directly?¶

Helm Release is `failed`¶

How do I check if `apps` drifted due to manual intervention?¶

How do I use `kubectl` and `helm` directly?¶