Troubleshooting for Platform Administrators¶
For Elastisys Self-Managed Customers
Please start by running these commands.
If you are struggling, don't hesitate to file a ticket.
You can run the following command from the compliantkubernetes-apps repository to collect diagnostic information that will help us support you.
CK8S_PGP_FP=<fingerprint provided during onboarding> ./bin/ck8s diagnostics <wc|sc>
Show more examples on using the diagnostics command
The command ck8s diagnostics
can be provided with different flags to gather additional information from your environment, to see all available options run:
./bin/ck8s diagnostics <wc|sc> --help
Some example use cases:
-
To include config files found in
CK8S_CONFIG_PATH
:CK8S_PGP_FP=<fingerprints> ./bin/ck8s diagnostics <wc|sc> --include-config
-
To retrieve more information such as YAML manifests for resources in a specific namespace, in this example
ingress-nginx
:CK8S_PGP_FP=<fingerprints> ./bin/ck8s diagnostics <wc|sc> --namespace ingress-nginx
Please also provide us with your terminal in a text format. We need to look both at the commands you typed and their output.
Help! Something is wrong with my Welkin cluster. Fear no more, this guide will help you make sense.
This guide assumes that:
- You have pre-requisites installed.
- Your environment variables, in particular
CK8S_CONFIG_PATH
is set, andCLUSTER
set to eithersc
orwc
. - Your config folder is available.
compliantkubernetes-apps
andcompliantkubernetes-kubespray
is available.
Important
./bin/ck8s
references the compliantkubernetes-apps
CLI
./bin/ck8s-kubespray
references the compliantkubernetes-kubespray
CLI
Important
For some of the ansible commands below, you might require root privileges. To run commands as a privileged user with ansible, use the --become, -b
flag.
Example:
ansible -i inventory.ini -b all -m ping
I have no clue where to start¶
If you get lost, start checking from the "physical layer" and up.
Are the Nodes still accessible via SSH?¶
ansible -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini all -m ping
Are the Nodes "doing fine"?¶
Dmesg should not display unexpected messages. OOM will show up here.
ansible -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; dmesg | tail -n 10'
Uptime should show high uptime (e.g., days) and low load (e.g., less than 3):
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; uptime'
Any process that uses too much CPU?
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; ps -Ao user,uid,comm,pid,pcpu,tty --sort=-pcpu | head -n 6'
Is there enough disk space? All writeable file-systems should have at least 30% free.
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; df -h'
Is there enough available memory? There should be at least a few GB of available memory.
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; cat /proc/meminfo | grep Available'
Can Nodes access the Internet?
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; hostname; curl --silent https://checkip.amazonaws.com'
Are the Nodes having the proper time? You should see System clock synchronized: yes
and NTP service: active
.
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'echo; timedatectl status'
Is the base OS doing fine?¶
We generally run the latest Ubuntu LTS, at the time of this writing Ubuntu 20.04 LTS.
You can confirm this by doing:
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'cat /etc/lsb-release'
Are systemd units running fine? You should see running
and not degraded
.
ansible -i $CK8S_CONFIG_PATH/${CLUSTER}-config/inventory.ini all -m shell -a 'systemctl is-system-running'
Are the Kubernetes clusters doing fine?¶
Are the Nodes reporting in on Kubernetes? All Kubernetes Nodes, both control-plane and workers, should be Ready
:
./bin/ck8s ops kubectl $CLUSTER get nodes
Is Rook doing fine?¶
If Rook is installed, is Rook doing fine? You should see HEALTH_OK
.
export CK8S_KUBESPRAY_PATH=/path/to/compliantkubernetes-kubespray
./bin/ck8s ops kubectl $CLUSTER -n rook-ceph apply -f $CK8S_KUBESPRAY_PATH/rook/toolbox-deploy.yaml
Once the Pod is Ready run:
./bin/ck8s ops kubectl $CLUSTER -n rook-ceph exec deploy/rook-ceph-tools -- ceph status
Are Kubernetes Pods doing fine?¶
Pods should be Running
or Completed
, and fully Ready
(e.g., 1/1
or 6/6
)?
./bin/ck8s ops kubectl $CLUSTER get --all-namespaces pods
Are all Deployments fine? Deployments should show all Pods Ready, Up-to-date and Available (e.g., 2/2 2 2
).
./bin/ck8s ops kubectl $CLUSTER get --all-namespaces deployments
Are all DaemonSets fine? DaemonSets should show as many Pods Desired, Current, Ready and Up-to-date, as Desired.
./bin/ck8s ops kubectl $CLUSTER get --all-namespaces ds
Are Helm Releases fine?¶
All Releases should be deployed
.
./bin/ck8s ops helm $CLUSTER list --all --all-namespaces
Is cert-manager doing fine?¶
Are (Cluster)Issuers fine? All Resources should be READY=True
or valid
.
./bin/ck8s ops kubectl $CLUSTER get clusterissuers,issuers,certificates,orders,challenges --all-namespaces
Where do I find the Nodes public and private IP?¶
find . -name inventory.ini
or
ansible-inventory -i ${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini --list all
ansible_host
is usually the public IP, while ip
is usually the private IP.
Node cannot be accessed via SSH¶
Important
Make sure it is "not you". Are you well connected to the VPN? Is this the only Node which lost SSH access?
Important
If you are using Rook, it is usually set up with replication 2, which means it can tolerate one restarting Node. Make sure that, either Rook is healthy or that you are really sure you are restarting the right Node.
Try connecting to the unhealthy Node via a different Node and internal IP:
UNHEALTHY_NODE=172.0.10.205 # You lost access to this one
JUMP_NODE=89.145.xxx.yyy # You have access to this one
ssh -J ubuntu@$JUMP_NODE ubuntu@$UNHEALTHY_NODE
Try rebooting the Node via Infrastructure Provider specific CLI:
UNHEALTHY_NODE=cksc-worker-2
# Example for ExoScale
exo vm reboot --force $UNHEALTHY_NODE
If using Rook make sure its health goes back to HEALTH_OK
.
A Node has incorrect time¶
Incorrect time on a Node can have sever consequences with replication and monitoring. In fact, if you follow ISO 27001, A.12.4.4 Clock Synchronisation requires you to ensure clocks are synchronized.
These days, Linux distributions should come out-of-the-box with timesyncd for time synchronization via NTP.
To figure out what is wrong, SSH into the target Node and try the following:
sudo systemctl status systemd-timesyncd
sudo journalctl --unit systemd-timesyncd
sudo timedatectl status
sudo timedatectl timesync-status
sudo timedatectl show-timesync
Possible causes include incorrect NTP server settings, or NTP being blocked by firewall. For reminder, NTP works over UDP port 123.
Node seems not fine¶
Important
If you are using Rook, it is usually set up with replication 2, which means it can tolerate one restarting Node. Make sure that, either Rook is healthy or that you are really sure you are restarting the right Node.
Try rebooting the Node:
UNHEALTHY_NODE=89.145.xxx.yyy
ssh ubuntu@$UNHEALTHY_NODE sudo reboot
If using Rook make sure its health goes back to HEALTH_OK
.
Node seems really not fine. I want a new one¶
Is it 2AM? Do not replace Nodes, instead simply add a new one. You might run out of capacity, you might lose redundancy, you might replace the wrong Node. Prefer to add a Node and see if that solves the problem.
Okay, I want to add a new Node¶
Prefer this option if you "quickly" need to add CPU, memory or storage (i.e., Rook) capacity.
First, check for infrastructure drift, as shown here.
Depending on your provider: If the infrastructure is not managed by terraform you can skip to step 3:
- Add a new Node by editing the
*.tfvars
. - Re-apply Terraform.
- Add the new node to the
inventory.ini
(skip this step if the cluster is using a dynamic inventory). -
Re-apply Kubespray only for the new node.
cd [welkin-kubespray-root-dir] CLUSTER=[sc | wc] ./bin/ck8s-kubespray run-playbook $CLUSTER facts.yml ./bin/ck8s-kubespray run-playbook $CLUSTER scale.yml -b --limit=[new_node_name]
-
Add ssh keys to the new node if necessary
./bin/ck8s-kubespray apply-ssh $CLUSTER --limit=[new_node_name]
-
Update Network Policies
cd [welkin-apps-root-dir] ./bin/ck8s update-ips sc update ./bin/ck8s update-ips wc update ./bin/ck8s ops helmfile sc -l app=common-np -i apply ./bin/ck8s ops helmfile wc -l app=common-np -i apply ./bin/ck8s ops helmfile sc -l app=service-cluster-np -i apply # or ./bin/ck8s ops helmfile wc -l app=workload-cluster-np -i apply
Check that the new Node joined the cluster, as shown here.
A systemd unit failed¶
SSH into the Node. Check which systemd unit is failing:
systemctl --failed
Gather more information:
FAILED_UNIT=fwupd-refresh.service
systemctl status $FAILED_UNIT
journalctl --unit $FAILED_UNIT
Rook seems not fine¶
Please check the following upstream documents:
Pod seems not fine¶
Make sure you are on the right cluster:
echo $CK8S_CONFIG_PATH
echo $CLUSTER
Find the name of the Pod which is not fine:
./bin/ck8s ops kubectl $CLUSTER get pod -A
# Copy-paste the Pod and Pod namespace below
UNHEALTHY_POD=prometheus-kube-prometheus-stack-prometheus-0
UNHEALTHY_POD_NAMESPACE=monitoring
Gather some "evidence" for later diagnostics, when the heat is over:
./bin/ck8s ops kubectl $CLUSTER describe pod -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
./bin/ck8s ops kubectl $CLUSTER logs -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
Try to kill and check if the underlying Deployment, StatefulSet or DaemonSet will restart it:
./bin/ck8s ops kubectl $CLUSTER delete pod -n $UNHEALTHY_POD_NAMESPACE $UNHEALTHY_POD
./bin/ck8s ops kubectl $CLUSTER get pod -A --watch
Helm Release is failed
¶
Make sure you are on the right cluster:
echo $CK8S_CONFIG_PATH
echo $CLUSTER
Find the failed Release:
./bin/ck8s ops helm $CLUSTER ls --all-namespaces --all
FAILED_RELEASE=user-rbac
FAILED_RELEASE_NAMESPACE=kube-system
Just to make sure, do a drift check, as shown here.
Remove the failed Release:
./bin/ck8s ops helm $CLUSTER uninstall -n $FAILED_RELEASE_NAMESPACE $FAILED_RELEASE
Re-apply apps
according to documentation.
cert-manager is not fine¶
Follow cert-manager's troubleshooting, specifically:
Failed to perform self check: no such host¶
If with kubectl describe challenges -A
you get an error similar to below:
Waiting for HTTP-01 challenge propagation: failed to perform self check
GET request ''http://url/.well-known/acme-challenge/xVfDZoLlqs4tad2qOiCT4sjChNRausd5iNpbWuGm5ls'':
Get "http://url/.well-known/acme-challenge/xVfDZoLlqs4tad2qOiCT4sjChNRausd5iNpbWuGm5ls":
dial tcp: lookup opensearch.domain on 10.177.0.3:53: no such host'
Then you might have a DNS issue inside your cluster. Make sure that global.clusterDns
in common-config.yaml
is set to the CoreDNS Service IP returned by kubectl get svc -n kube-system coredns
.
Failed to perform self check: connection timed out¶
If with kubectl describe challenges -A
you get an error similar to below:
Reason: Waiting for http-01 challenge propagation: failed to perform self check GET request 'http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA': Get "http://abc.com/.well-known/acme-challenge/Oej8tloD2wuHNBWS6eVhSKmGkZNfjLRemPmpJoHOPkA": dial tcp 18.192.17.98:80: connect: connection timed out
Then your Kubernetes data plane Nodes cannot connect to themselves with the IP address of the load-balancer that fronts them. The easiest is to configure the load-balancer's IP address on the loopback interface of each Nodes. (See example here.)
How do I check if infrastructure drifted due to manual intervention?¶
Go to the docs of the Infrastructure Provider and run Terraform plan
instead of apply
. For Exoscale, it looks as follows:
TF_SCRIPTS_DIR=$(readlink -f compliantkubernetes-kubespray/kubespray/contrib/terraform/exoscale)
pushd ${TF_SCRIPTS_DIR}
export TF_VAR_inventory_file=${CK8S_CONFIG_PATH}/${CLUSTER}-config/inventory.ini
terraform init
terraform plan \
-var-file=${CK8S_CONFIG_PATH}/${CLUSTER}-config/cluster.tfvars \
-state=${CK8S_CONFIG_PATH}/${CLUSTER}-config/terraform.tfstate
popd
How do I check if the Kubespray setup drifted due to manual intervention?¶
At the time of this writing, this cannot be done, but efforts are underway.
How do I check if apps
drifted due to manual intervention?¶
# For Management Cluster
./bin/ck8s ops helmfile sc diff # Respond "n" if you get WARN
# For the Workload Clusters
./bin/ck8s ops helmfile wc diff # Respond "n" if you get WARN
Velero backup stuck in progress¶
Velero is known to get stuck InProgress
when doing backups
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
velero-daily-backup-20211005143248 InProgress 0 0 2021-10-05 14:32:48 +0200 CEST 29d default !nobackup
First try to delete the backup
./velero backup delete velero-daily-backup-20211005143248
Then kill all the pods under the velero namespace
./bin/ck8s ops kubectl wc delete pods -n velero --all
Check that the backup is gone
velero backup get
NAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR
Recreate the backup from a schedule
velero backup create --from-schedule velero-daily-backup
How do I use kubectl
and helm
directly?¶
This guide makes heavy use of the compliantkubernetes-apps
CLI to access and control Welkin clusters. However, you can use kubectl
and helm
directly, by exporting a KUBECONFIG
like so:
export KUBECONFIG=${CK8S_CONFIG_PATH}/.state/kube_config_${CLUSTER}.yaml
kubectl get pods -A
helm list -A