Configure Prometheus Alert Manager for a Kubernetes Cluster with Custom Alerts
Prerequisites
AWS Account with EC2 Instance.(t2.medium)
Minikube and kubectl, Helm Installed
Basic knowledge of Kubernetes
Update the Package List.
sudo apt update
Installs essential tools like curl, wget and apt-transport-https.
sudo apt install curl wget apt-transport-https -y
Installs Docker, a container runtime that will be used as the VM driver for Minikube.
sudo apt install docker.io -y
Add the current user to the Docker group, allowing the user to run Docker commands without sudo
.
sudo usermod -aG docker $USER
Adjust permissions for the Docker socket, enabling easier communication with the Docker daemon.
sudo chmod 666 /var/run/docker.sock
Checks if the system supports virtualization.
egrep -q 'vmx|svm' /proc/cpuinfo && echo yes || echo no
Install KVM and Related Tools.
sudo apt install qemu-kvm libvirt-clients libvirt-daemon-system bridge-utils virtinst libvirt-daemon
Add User to Virtualization Groups.
sudo adduser $USER libvirt
sudo adduser $USER libvirt-qemu
Reload Group.
newgrp libvirt
newgrp libvirt-qemu
Install Minikube and kubectl
Download the latest Minikube binary.
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
Install it to /usr/local/bin
, making it available system-wide.
sudo install minikube-linux-amd64 /usr/local/bin/minikube
Use minikube version command to confirm the installation.
minikube version
Download the latest version of kubectl
(Kubernetes CLI).
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
Make the kubectl binary executable.
chmod +x ./kubectl
move it to /usr/local/bin
sudo mv kubectl /usr/local/bin/
Use kubectl version command to check the installation.
kubectl version --client --output=yaml
Start the Minikube
Start Minikube with Docker as the driver.
minikube start --vm-driver docker
To Check the status of Minikube run the following command.
minikube status
Install the Helm
Download the helm, a package manager for Kubernetes.
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
Change its permissions.
chmod 700 get_helm.sh
Install the helm.
./get_helm.sh
Check its version to confirm the installation.
helm version
Add Prometheus Helm Chart Repository.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
update the Helm repositories to fetch the latest charts.
helm repo update
Configure Prometheus, Alertmanager and Grafana on Kubernetes Cluster
Create a custom-values.yaml file to configure Prometheus and Grafana services as NodePort
nano custom-values.yaml
Add the following configuration.
prometheus:
service:
type: NodePort
grafana:
service:
type: NodePort
alertmanager:
service:
type: NodePort
Deploy the Prometheus, Alertmanager and Grafana stack.
helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack -f custom-values.yaml
This command will deploy Prometheus, Alertmanager, and Grafana to your cluster with the services exposed as NodePort.
Access Prometheus, Alertmanager and Grafana
List the services to get NodePort details.
kubectl get services
Forward the Prometheus service to port 9090.
kubectl port-forward --address 0.0.0.0 svc/kube-prometheus-stack-prometheus 9090:9090
Access the UI of Prometheus on web browser using http://<EC2-Public-IP>:9090
.
Open the duplicate tab and Forward the Alertmanager service to port 9093.
kubectl port-forward --address 0.0.0.0 svc/kube-prometheus-stack-alertmanager 9093:9093
Access the UI of Alertmanager on web browser using http://<EC2-Public-IP>:9093
.
To access the alertmanager alerts, Click on alerts on Prometheus UI to see them.
Here you can see the alerts which are in firing state, and you can see them on alertmanager UI also
Open the duplicate tab and Forward the Grafana service to port 3000.
kubectl port-forward --address 0.0.0.0 svc/kube-prometheus-stack-grafana 3000:80
Access the UI of Grafana on web browser using http://<EC2-Public-IP>:3000
.
Configure Custom Alert Rules
So now we have seen the default alerts configured in prometheus and alertmanager so far.
Next lets try to add the custom alert rules in it to monitor our kubernetes cluster.
Open another duplicate tab. Create and define alert rules in a custom-alert-rules.yaml
file.
nano custom-alert-rules.yaml
add the following content into it.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
app: kube-prometheus-stack
app.kubernetes.io/instance: kube-prometheus-stack
release: kube-prometheus-stack
name: kube-pod-not-ready
spec:
groups:
- name: my-pod-demo-rules
rules:
- alert: KubernetesPodNotHealthy
expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
for: 1m
labels:
severity: critical
annotations:
summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesDaemonsetRolloutStuck
expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
for: 10m
labels:
severity: warning
annotations:
summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighCpuUtilization
expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container High CPU utilization (instance {{ $labels.instance }})
description: "Container CPU utilization is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: ContainerHighMemoryUsage
expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: Container High Memory usage (instance {{ $labels.instance }})
description: "Container Memory usage is above 80%\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesContainerOomKiller
expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
for: 0m
labels:
severity: warning
annotations:
summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
- alert: KubernetesPodCrashLooping
expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
for: 2m
labels:
severity: warning
annotations:
summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Apply the custom alert rules to the kubernetes cluster.
kubectl apply -f custom-alert-rules.yaml
next visit the prometheus UI again to see the custom alert rules (refresh the page).
as you can see we have successfully added the new custom alert rules to our alertmanager.
Deploy a Test Application
Now to check our custom alert rules are working we’ll create a application with wrong image tag.
Application performance monitoring solutions
Deploy an Nginx pod to test monitoring and alerting.
kubectl run nginx-pod --image=nginx:lates3
the correct tag is latest but we have written a lates3 to trigger the alert.
check the status of the pod.
kubectl get pods nginx-pod
As you can see the pod is not ready and showing an error of ImagePullBackOff.
Now this will trigger the alert since we have created custom alert rules.
Go to the prometheus and alertmanager UI, refresh the page and look for the alert related to pod failure. It should appear at the user interface.
Understanding Custom Alert Rules:
Key Components of an Alert Rule:
Name: Unique alert identifier.
Expression (expr): PromQL condition.
For: Time to persist condition before triggering.
Labels: Metadata to categorize alerts (e.g.,
severity
).Annotations: Dynamic info to help diagnose the issue.