Skip to content

Runbooks

Note

This page describes runbooks for platform administrators. As an application developer, you are most likely looking for Troubleshooting for Application Developers.

Elastisys-self-managed

To get access to the Welkin runbooks, contact Elastisys.

Runbooks document step-by-step processes, ensuring that tasks are performed consistently and correctly, reducing errors and reliance on tribal knowledge. Runbooks are essential for maintaining efficiency, consistency, and reliability in Welkin operations.

Welkin runbooks are searchable and describe what to do:

  • when receiving a change request from an application developer;
  • how to handle an alert.

The remainder of this page illustrates an example of a Welkin runbook for the ThanosCompactHalted alert.


Example: Thanos Compact Halted - PVC

Alert: ThanosCompactHalted

Tags

  • Thanos
  • ThanosCompactHalted
  • PVC

Reason

The Thanos compactor is responsible for downsampling and pushing metrics from its PVC to object storage.

One reason the compactor has halted is that the volume attached to Thanos compactor is full.

Impact

  • Retention rules are not enforced
  • Query performance might degrade
  • Downsampling is not performed

Diagnosis

Start investigating if the PVC for the Thanos compactor is full. This can be easily checked in the Grafana dashboard (Kubernetes/Persistent Volumes).

It is also possible to look at the compactor logs, which should show that there is no space left on the device.

kubectl logs -n thanos deployments/thanos-receiver-compactor | grep halt

ts=2024-07-18T05:53:16.578898999Z caller=compact.go:527 level=error msg="critical error detected; halting" ... 2 errors: preallocate: no space left on device; ..."

Mitigation

To resolve this issue, we need to increase the volume size for the compactor.

Update the sc/mc-config.yaml to increase the PVC for Thanos compactor

     persistence:
       size: 50Gi
   compactor:
      persistence:
-      size: XGi
+      size: YGi
./bin/ck8s ops helmfile sc -l app=thanos diff
./bin/ck8s ops helmfile sc -l app=thanos apply

Wait for a while, then check for halting again.

kubectl logs -n thanos deployments/thanos-receiver-compactor | grep halt