Configure Prometheus Alert Manager for a Kubernetes Cluster with Custom Alerts

Prerequisites

AWS Account with EC2 Instance.(t2.medium)
Minikube and kubectl, Helm Installed
Basic knowledge of Kubernetes

Update the Package List.

sudo apt update

Installs essential tools like curl, wget and apt-transport-https.

sudo apt install curl wget apt-transport-https -y

Installs Docker, a container runtime that will be used as the VM driver for Minikube.

sudo apt install docker.io -y

Add the current user to the Docker group, allowing the user to run Docker commands without sudo.

sudo usermod -aG docker $USER

Adjust permissions for the Docker socket, enabling easier communication with the Docker daemon.

sudo chmod 666 /var/run/docker.sock

Checks if the system supports virtualization.

egrep -q 'vmx|svm' /proc/cpuinfo && echo yes || echo no

Install KVM and Related Tools.

sudo apt install qemu-kvm libvirt-clients libvirt-daemon-system bridge-utils virtinst libvirt-daemon

Add User to Virtualization Groups.

sudo adduser $USER libvirt
sudo adduser $USER libvirt-qemu

Reload Group.

newgrp libvirt
newgrp libvirt-qemu

Install Minikube and kubectl

Download the latest Minikube binary.

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64

Install it to /usr/local/bin, making it available system-wide.

sudo install minikube-linux-amd64 /usr/local/bin/minikube

Use minikube version command to confirm the installation.

minikube version

Download the latest version of kubectl (Kubernetes CLI).

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"

Make the kubectl binary executable.

chmod +x ./kubectl

move it to /usr/local/bin

sudo mv kubectl /usr/local/bin/

Use kubectl version command to check the installation.

kubectl version --client --output=yaml

Start the Minikube

Start Minikube with Docker as the driver.

minikube start --vm-driver docker

To Check the status of Minikube run the following command.

minikube status

Install the Helm

Download the helm, a package manager for Kubernetes.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3

Change its permissions.

chmod 700 get_helm.sh

Install the helm.

./get_helm.sh

Check its version to confirm the installation.

helm version

Add Prometheus Helm Chart Repository.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

update the Helm repositories to fetch the latest charts.

helm repo update

Configure Prometheus, Alertmanager and Grafana on Kubernetes Cluster

Create a custom-values.yaml file to configure Prometheus and Grafana services as NodePort

nano custom-values.yaml

Add the following configuration.

prometheus:
  service:
    type: NodePort
grafana:
  service:
    type: NodePort
alertmanager:
  service:
    type: NodePort

Deploy the Prometheus, Alertmanager and Grafana stack.

helm upgrade --install kube-prometheus-stack prometheus-community/kube-prometheus-stack -f custom-values.yaml

This command will deploy Prometheus, Alertmanager, and Grafana to your cluster with the services exposed as NodePort.

Access Prometheus, Alertmanager and Grafana

List the services to get NodePort details.

kubectl get services

Forward the Prometheus service to port 9090.

kubectl port-forward --address 0.0.0.0 svc/kube-prometheus-stack-prometheus 9090:9090

Access the UI of Prometheus on web browser using http://<EC2-Public-IP>:9090.

Open the duplicate tab and Forward the Alertmanager service to port 9093.

kubectl port-forward --address 0.0.0.0 svc/kube-prometheus-stack-alertmanager 9093:9093

Access the UI of Alertmanager on web browser using http://<EC2-Public-IP>:9093.

To access the alertmanager alerts, Click on alerts on Prometheus UI to see them.

Here you can see the alerts which are in firing state, and you can see them on alertmanager UI also

Open the duplicate tab and Forward the Grafana service to port 3000.

kubectl port-forward --address 0.0.0.0 svc/kube-prometheus-stack-grafana 3000:80

Access the UI of Grafana on web browser using http://<EC2-Public-IP>:3000.

Configure Custom Alert Rules

So now we have seen the default alerts configured in prometheus and alertmanager so far.

Next lets try to add the custom alert rules in it to monitor our kubernetes cluster.

Open another duplicate tab. Create and define alert rules in a custom-alert-rules.yaml file.

nano custom-alert-rules.yaml

add the following content into it.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: kube-prometheus-stack
    release: kube-prometheus-stack
  name: kube-pod-not-ready
spec:
  groups:
  - name: my-pod-demo-rules
    rules:
    - alert: KubernetesPodNotHealthy
      expr: sum by (namespace, pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"}) > 0
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: Kubernetes Pod not healthy (instance {{ $labels.instance }})
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-running state for longer than 15 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"      
    - alert: KubernetesDaemonsetRolloutStuck
      expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100 or kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})
        description: "Some Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled or not ready\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: ContainerHighCpuUtilization
      expr: (sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (pod, container) / sum(container_spec_cpu_quota{container!=""}/container_spec_cpu_period{container!=""}) by (pod, container) * 100) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Container High CPU utilization (instance {{ $labels.instance }})
        description: "Container CPU utilization is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: ContainerHighMemoryUsage
      expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Container High Memory usage (instance {{ $labels.instance }})
        description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: KubernetesContainerOomKiller
      expr: (kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 10m >= 1) and ignoring (reason) min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[10m]) == 1
      for: 0m
      labels:
        severity: warning
      annotations:
        summary: Kubernetes Container oom killer (instance {{ $labels.instance }})
        description: "Container {{ $labels.container }} in pod {{ $labels.namespace }}/{{ $labels.pod }} has been OOMKilled {{ $value }} times in the last 10 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"
    - alert: KubernetesPodCrashLooping
      expr: increase(kube_pod_container_status_restarts_total[1m]) > 3
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: Kubernetes pod crash looping (instance {{ $labels.instance }})
        description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Apply the custom alert rules to the kubernetes cluster.

kubectl apply -f custom-alert-rules.yaml

next visit the prometheus UI again to see the custom alert rules (refresh the page).

as you can see we have successfully added the new custom alert rules to our alertmanager.

Deploy a Test Application

Now to check our custom alert rules are working we’ll create a application with wrong image tag.

Application performance monitoring solutions

Deploy an Nginx pod to test monitoring and alerting.

kubectl run nginx-pod --image=nginx:lates3

the correct tag is latest but we have written a lates3 to trigger the alert.

check the status of the pod.

kubectl get pods nginx-pod

As you can see the pod is not ready and showing an error of ImagePullBackOff.

Now this will trigger the alert since we have created custom alert rules.

Go to the prometheus and alertmanager UI, refresh the page and look for the alert related to pod failure. It should appear at the user interface.

Understanding Custom Alert Rules:

Key Components of an Alert Rule:

Name: Unique alert identifier.
Expression (expr): PromQL condition.
For: Time to persist condition before triggering.
Labels: Metadata to categorize alerts (e.g., severity).
Annotations: Dynamic info to help diagnose the issue.