Alerts via Alertmanager¶
Welkin includes alerts via Alertmanager.
Important
By default, you will get some platform alerts. This may benefit you, by giving you improved "situational awareness". Please decide if these alerts are of interest to you or not. Feel free to silence them, as the Welkin administrator will take responsibility for them.
Your focus should be on user alerts or application-level alerts, i.e., alerts under the control and responsibility of the Welkin user. We will focus on user alerts in this document.
Compliance needs¶
Many regulations require you to have an incident management process. Alerts help you discover abnormal application behavior that need attention. This maps to ISO 27001 – Annex A.16: Information Security Incident Management.
Configuring user alerts¶
User alerts are configured via the Secret alertmanager-alertmanager
located in the alertmanager
namespace. This configuration file is specified here.
# retrieve the old configuration:
kubectl get -n alertmanager secret alertmanager-alertmanager -o jsonpath='{.data.alertmanager\.yaml}' | base64 -d > alertmanager.yaml
# edit alertmanager.yaml as needed
# patch the new configuration:
kubectl patch -n alertmanager secret alertmanager-alertmanager -p "{\"data\":{\"alertmanager.yaml\":\"$(base64 -w 0 < alertmanager.yaml)\"}}"
# mac users may need to omit -w 0 arguments to base64:
kubectl patch -n alertmanager secret alertmanager-alertmanager -p "{\"data\":{\"alertmanager.yaml\":\"$(base64 < alertmanager.yaml)\"}}"
Make sure to configure and test a receiver for you alerts, e.g., Slack or OpsGenie.
Note
If you get an access denied error, check with your Welkin administrator.
Silencing alerts¶
Welkin comes with a lot of predefined alerts. As a user you might not find all of them relevant and would want to silence/ignore some of them. You can do this by adding new routes in the secret and set receiver: 'null'
. Here is an example that would drop all alerts from the kube-system namespace (alerts with the label namespace=kube-system
):
routes:
- receiver: "null"
matchers:
- namespace = kube-system
You can match any label in the alerts, read more about how the matcher
configuration works in the upstream documentation.
Accessing user Alertmanager¶
If you want to access Alertmanager, for example to confirm that its configuration was picked up correctly, proceed as follows:
- Type:
kubectl proxy
. - Open this link in your browser.
You can configure silences in the UI, but they will not be persisted if Alertmanager is restarted. Use the secret mentioned above instead to create silences that persist.
Configuring alerts¶
Before setting up an alert, you must first collect metrics from your application by setting up either ServiceMonitors or PodMonitors. In general ServiceMonitors are recommended over PodMonitors, and it is the most common way to configure metrics collection.
Then create a PrometheusRule
following the examples below, or the upstream documentation, with an expression that evaluates to the condition to alert on. Prometheus will pick them up, evaluate them, and then send notifications to Alertmanager.
The API reference for the Prometheus Operator describes how the Kubernetes resource is configured, and the configuration reference for Prometheus describes the rules themselves.
In Welkin the Prometheus Operator in the Workload Cluster is configured to pick up all PrometheusRules, regardless in which namespace they are or which labels they have.
Running Example¶
The user demo already includes a PrometheusRule, to configure an alert:
{{- if .Values.prometheusRule.enabled -}}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: {{ include "welkin-user-demo.fullname" . }}
labels:
{{- include "welkin-user-demo.labels" . | nindent 4 }}
spec:
groups:
- name: ./example.rules
rules:
- alert: ApplicationIsActuallyUsed
expr: rate(http_request_duration_seconds_count[1m])>1
{{- end }}
The screenshot below gives an example of the application alert, as seen in Alertmanager.
Detailed example¶
PrometheusRules have two features, either the rules alert based on an expression, or the rules record
based on an expression.
The former is the way to create alerting rules and the latter is a way to precompute complex queries that will be stored as separate metrics:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: example
role: alert-rules
name: prometheus-example-rules
spec:
groups:
- name: ./example.rules
# interval: 30s # optional parameter to configure how often groups of rules are evaluated
rules:
- alert: ExampleAlert
expr: vector(1)
# for: 1m # optional parameter to configure how long an alert must be triggered to be fired
labels:
severity: high
annotations:
summary: "Example Alert has been fired!"
description: "The Example Alert has been fired! It shows the value {{ $value }}."
- record: example_record_metric
expr: vector(1)
labels:
record: example
For alert rules, labels and annotations can be added or overridden, which will then be included in the resulting alert notifications. Furthermore, the annotations support Go Templating, allowing access to the evaluated value via the $value
variable, and all labels from the expression using the $labels
variable.
For recording rules, labels can be added or overridden, which will then be included in the resulting metric.