Use Probe to Measure Uptime of Internal Compliant Kubernetes Services

  • Status: accepted
  • Deciders: Cristian, Lucian, Ravi
  • Date: 2021-11-25

Context and Problem Statement

We need to measure uptime for at least two reasons:

  1. To serve as feedback on what needs to be improved next.
  2. To demonstrate compliance with our SLAs.

How exactly should we measure uptime?

Decision Drivers

  • We want to reduce tools sprawl.
  • We want to be mindful about capacity and infrastructure costs.
  • We want to measure uptime as observed by a consumer -- i.e., application or user -- taking into account business continuity measures, such as redundancy, fail-over time, etc.

Considered Options

Decision Outcome

Chosen option: "use Probe for measuring uptime of internal Compliant Kubernetes services", because it measures uptime as observed by a consumer. Although this requires a bit of extra capacity for running Blackbox, the costs are worth the benefits.

Instead of configuring Blackbox directly, Probe is a cleaner abstraction provided by the Prometheus Operator.

The following is an example for a Probe:

kind: Probe
  name: google-is-up
    probe: google
    release: kube-prometheus-stack
  interval: 60s
  module: http_2xx
    url: blackbox-prometheus-blackbox-exporter.monitoring.svc.cluster.local:9115

This will generate a metric as follows: probe_success{cluster="ckdemo-wc", instance="", job="probe/demo1/google-is-up", namespace="demo1"}.

Positive Consequences

  • We measure uptime as observed by a consumer.
  • Increasing redundancy, reducing failure time, etc. will contribute positively to our uptime, as desired.

Negative Consequences

  • We don't currently run Blackbox in the workload cluster, so we'll need a bit of extra capacity.

Recommendations to Operators

Blackbox should only be used for measuring uptime of internal services, i.e., those that are only exposed within the Kubernetes cluster. Examples include additional services, such as PostgreSQL, Redis and RabbitMQ.

For external endpoints -- specifically, Dex, Grafana, Kibana, Harbor and Ingress Controllers -- prefer using an external uptime service which integrates with an On-Call Management Tool, e.g., Uptime Cloud Monitor Integration for Opsgenie.

External uptime measurement should achieve the similar effect as the commands below:

curl --head https://dex.$DOMAIN/healthz
curl --include https://harbor.$DOMAIN/api/v2.0/health
curl --head https://grafana.$DOMAIN/healthz
curl --head https://kibana.$DOMAIN/api/status

curl --head some-domain.$DOMAIN/healthz  # Pokes the WC Ingress Controller