Introduction
You create a new Kubernetes cluster or have an existing Kubernetes cluster and would like to set up observability on the cluster. The popular open source solution commonly used is Prometheus and Grafana.
Prerequisites
- Kubernetes API access
- Kubeconfig
- Helm
Step-by-Step Instructions
-
Step 1: Add the Helm repo or update it if you already have it
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update
-
Step 2: Install the kube-prometheus-stack helm chart
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack -n kube-prometheus-stack --create-namespace
-
Step 3: Expose the Grafana UI through a kubernetes service
kubectl -n kube-prometheus-stack expose deploy/kube-prometheus-stack-grafana --name grafana-np --type <service-type>
- Step 4: To import NVIDIA Data Center GPU Manager (DCGM) metrics into the Grafana instance, run the following command:
kubectl -n nvidia-gpu-operator label servicemonitor nvidia-dcgm-exporter release=kube-prometheus-stack
- Step 5: Import the DCGM Grafana dashboard into your Grafana instance:
https://grafana.com/grafana/dashboards/12239-nvidia-dcgm-exporter-dashboard/
Additional Resources
- Configure the username and password for the Grafana chart by creating a custom values file:
https://github.com/prometheus-community/helm-charts/blob/ae248069a3c9aac30262eb5d6f93a12db52fb065/charts/kube-prometheus-stack/values.yaml#L1246 - Grafana installs the default Kubernetes dashboards, but you can import custom dashboards provided here https://grafana.com/grafana/dashboards/
Example: https://grafana.com/grafana/dashboards/15661-k8s-dashboard-en-20250125/ - To keep the metrics and dashboards persistent, create a persistent volume for both Prometheus and Grafana:
https://github.com/prometheus-community/helm-charts/blob/ae248069a3c9aac30262eb5d6f93a12db52fb065/charts/kube-prometheus-stack/values.yaml#L4251
https://github.com/prometheus-community/helm-charts/blob/ae248069a3c9aac30262eb5d6f93a12db52fb065/charts/kube-prometheus-stack/values.yaml#L1292
Comments
0 comments
Article is closed for comments.