How To Fix Missing GPU Metrics (DCGM) on CMK Nodes

Introduction

Some GPU nodes in a Crusoe Managed Kubernetes (CMK) cluster may stop reporting DCGM metrics (such as DCGM_FI_DEV_GPU_UTIL) to Grafana. This is caused by a known startup race condition in the crusoe-watch-agent DaemonSet, between the Vector metrics collector and its config-reloader sidecar.

This issue can be triggered by any of the following scenarios:

Deploying the crusoe-watch-agent monitoring solution: initial install on a cluster
Deploying the grafana-cmk Grafana dashboard: which may trigger a watch-agent rollout
Rolling restart of the watch-agent DaemonSet: e.g., kubectl rollout restart daemonset/crusoe-watch-agent -n crusoe-system
New nodes joining a cluster: nodepool scale-up or node replacement
Pod rescheduling: due to node drain, eviction, or OOM kill

In each case, the config-reloader sidecar writes the DCGM scrape configuration to /etc/vector/vector.yaml within ~500ms of Vector starting its file watcher. Vector misses the inotify event, never reloads, and never begins scraping DCGM metrics from that pod.

This guide walks you through identifying affected nodes, applying the fix, and verifying that full GPU metrics coverage is restored.

Prerequisites

kubectl Access to the Affected CMK Cluster
Permissions to Exec Into Pods in the crusoe-system Namespace
Crusoe Watch Agent and Grafana CMK Solutions Deployed on the Cluster
Access to the Cluster's Grafana (to Confirm Metrics Are Flowing)

Instructions

Step 1: Confirm the Problem

Open the cluster's Grafana dashboard and check for gaps in GPU metrics. If a subset of nodes are reporting DCGM_FI_DEV_GPU_UTIL while others show no data at all, the race condition is likely the cause.

You can also run the following to see which watch-agent pods have DCGM scrape targets loaded:

kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o name | \
  while read pod; do
    echo "--- $pod ---"
    kubectl exec -n crusoe-system "$pod" -c vector -- \
      cat /etc/vector/vector.yaml 2>/dev/null | grep -c dcgm
  done

Pods printing 0 have hit the race condition. Vector never picked up the DCGM config.

Step 2: Identify Stuck Pods

List all watch-agent pods and compare against the Grafana gaps:

kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o wide

Cross-reference the NODE column with the nodes missing metrics in Grafana. Every stuck pod corresponds to a node with no DCGM data.

Step 3: Apply the Fix

Run the following command to touch the config file on all watch-agent pods. This triggers a fresh inotify event that Vector's file watcher will catch, causing it to reload and pick up the DCGM scrape configuration:

kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o name | \
  xargs -I{} kubectl exec -n crusoe-system {} -c vector-config-reloader -- \
  touch /etc/vector/vector.yaml

This is safe to run on all pods, pods that are already scraping DCGM metrics will simply reload with the same config.

Step 4: Verify the Fix

Wait 2–3 minutes for metrics to begin flowing, then check Grafana again. All GPU nodes should now be reporting DCGM metrics.

You can also re-run the check from Step 1 to confirm all pods now have DCGM config loaded (output should be >0 for every pod).

ℹ️ Note: If the total time series count is less than your total node count, check whether some nodes are CPU-only — they won't have a DCGM exporter and won't report GPU metrics. You can verify with:

kubectl get nodes -l 'nvidia.com/gpu.deploy.dcgm-exporter!=true' \
  -o custom-columns='NAME:.metadata.name,TYPE:.metadata.labels.beta\.kubernetes\.io/instance-type'

Example

Symptom: After deploying the grafana-cmk solution on a cluster, only a subset of GPU nodes report DCGM_FI_DEV_GPU_UTIL in Grafana.

Checking the DaemonSet:

$ kubectl get daemonset crusoe-watch-agent -n crusoe-system
NAME                 DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
crusoe-watch-agent   356       356       356     356          356         <none>          3d5h

All pods are running, the issue isn't pod health, it's that Vector never loaded the DCGM scrape config.

Identifying stuck vs. working pods:

$ kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent \
    -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \
  | xargs -P20 -I{} sh -c \
    'if kubectl logs -n crusoe-system {} -c vector --tail=50 2>/dev/null \
       | grep -q "Vector has reloaded"; then echo "OK"; else echo "STUCK"; fi' \
  | sort | uniq -c

 175 OK
 181 STUCK

What the config-reloader logs look like on a STUCK pod:

{"timestamp": "...", "message": "Initial pod set: [('nvidia-dcgm-exporter-xxxxx', 'dcgm_exporter')]"}
{"timestamp": "...", "message": "Vector config written. Scraped pods by type: {'dcgm_exporter': 1}"}
{"timestamp": "...", "message": "Reconcile: no changes detected (pods=1, cycle_ms=101)."}
{"timestamp": "...", "message": "Reconcile: no changes detected (pods=1, cycle_ms=89)."}

The config-reloader found the DCGM exporter immediately and wrote the config once. Vector missed this single write event, and since the config-reloader sees "no changes" on every subsequent cycle, it never writes again.

What it looks like on a working pod:

{"timestamp": "...", "message": "Initial pod set: []"}
{"timestamp": "...", "message": "Vector config written. Scraped pods by type: none"}
{"timestamp": "...", "message": "Pod set diff: added=[('nvidia-dcgm-exporter-yyyyy', 'dcgm_exporter')]"}
{"timestamp": "...", "message": "Vector config written. Scraped pods by type: {'dcgm_exporter': 1}"}

The DCGM exporter wasn't ready on the first cycle, so the config-reloader wrote an empty config initially. On the next cycle, it discovered the exporter and wrote the full config. Vector's file watcher was fully initialised by then and caught the second write.

Applying the fix:

$ kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o name \
  | xargs -I{} kubectl exec -n crusoe-system {} -c vector-config-reloader \
    -- touch /etc/vector/vector.yaml

After running, all nodes reported DCGM metrics in Grafana within a few minutes.

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.