Introduction
Some GPU nodes in a Crusoe Managed Kubernetes (CMK) cluster may stop reporting DCGM metrics (such as DCGM_FI_DEV_GPU_UTIL) to Grafana. This is caused by a known startup race condition in the crusoe-watch-agent DaemonSet, between the Vector metrics collector and its config-reloader sidecar.
This issue can be triggered by any of the following scenarios:
- Deploying the crusoe-watch-agent monitoring solution: initial install on a cluster
- Deploying the grafana-cmk Grafana dashboard: which may trigger a watch-agent rollout
-
Rolling restart of the watch-agent DaemonSet: e.g.,
kubectl rollout restart daemonset/crusoe-watch-agent -n crusoe-system - New nodes joining a cluster: nodepool scale-up or node replacement
- Pod rescheduling: due to node drain, eviction, or OOM kill
In each case, the config-reloader sidecar writes the DCGM scrape configuration to /etc/vector/vector.yaml within ~500ms of Vector starting its file watcher. Vector misses the inotify event, never reloads, and never begins scraping DCGM metrics from that pod.
This guide walks you through identifying affected nodes, applying the fix, and verifying that full GPU metrics coverage is restored.
Prerequisites
-
kubectlAccess to the Affected CMK Cluster - Permissions to Exec Into Pods in the
crusoe-systemNamespace - Crusoe Watch Agent and Grafana CMK Solutions Deployed on the Cluster
- Access to the Cluster's Grafana (to Confirm Metrics Are Flowing)
Instructions
Step 1: Confirm the Problem
Open the cluster's Grafana dashboard and check for gaps in GPU metrics. If a subset of nodes are reporting DCGM_FI_DEV_GPU_UTIL while others show no data at all, the race condition is likely the cause.
You can also run the following to see which watch-agent pods have DCGM scrape targets loaded:
kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o name | \
while read pod; do
echo "--- $pod ---"
kubectl exec -n crusoe-system "$pod" -c vector -- \
cat /etc/vector/vector.yaml 2>/dev/null | grep -c dcgm
donePods printing 0 have hit the race condition. Vector never picked up the DCGM config.
Step 2: Identify Stuck Pods
List all watch-agent pods and compare against the Grafana gaps:
kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o wide
Cross-reference the NODE column with the nodes missing metrics in Grafana. Every stuck pod corresponds to a node with no DCGM data.
Step 3: Apply the Fix
Run the following command to touch the config file on all watch-agent pods. This triggers a fresh inotify event that Vector's file watcher will catch, causing it to reload and pick up the DCGM scrape configuration:
kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o name | \
xargs -I{} kubectl exec -n crusoe-system {} -c vector-config-reloader -- \
touch /etc/vector/vector.yamlThis is safe to run on all pods, pods that are already scraping DCGM metrics will simply reload with the same config.
Step 4: Verify the Fix
Wait 2–3 minutes for metrics to begin flowing, then check Grafana again. All GPU nodes should now be reporting DCGM metrics.
You can also re-run the check from Step 1 to confirm all pods now have DCGM config loaded (output should be >0 for every pod).
ℹ️ Note: If the total time series count is less than your total node count, check whether some nodes are CPU-only — they won't have a DCGM exporter and won't report GPU metrics. You can verify with:
kubectl get nodes -l 'nvidia.com/gpu.deploy.dcgm-exporter!=true' \ -o custom-columns='NAME:.metadata.name,TYPE:.metadata.labels.beta\.kubernetes\.io/instance-type'
Example
Symptom: After deploying the grafana-cmk solution on a cluster, only a subset of GPU nodes report DCGM_FI_DEV_GPU_UTIL in Grafana.
Checking the DaemonSet:
$ kubectl get daemonset crusoe-watch-agent -n crusoe-system NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE crusoe-watch-agent 356 356 356 356 356 <none> 3d5h
All pods are running, the issue isn't pod health, it's that Vector never loaded the DCGM scrape config.
Identifying stuck vs. working pods:
$ kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' \
| xargs -P20 -I{} sh -c \
'if kubectl logs -n crusoe-system {} -c vector --tail=50 2>/dev/null \
| grep -q "Vector has reloaded"; then echo "OK"; else echo "STUCK"; fi' \
| sort | uniq -c
175 OK
181 STUCKWhat the config-reloader logs look like on a STUCK pod:
{"timestamp": "...", "message": "Initial pod set: [('nvidia-dcgm-exporter-xxxxx', 'dcgm_exporter')]"}
{"timestamp": "...", "message": "Vector config written. Scraped pods by type: {'dcgm_exporter': 1}"}
{"timestamp": "...", "message": "Reconcile: no changes detected (pods=1, cycle_ms=101)."}
{"timestamp": "...", "message": "Reconcile: no changes detected (pods=1, cycle_ms=89)."}The config-reloader found the DCGM exporter immediately and wrote the config once. Vector missed this single write event, and since the config-reloader sees "no changes" on every subsequent cycle, it never writes again.
What it looks like on a working pod:
{"timestamp": "...", "message": "Initial pod set: []"}
{"timestamp": "...", "message": "Vector config written. Scraped pods by type: none"}
{"timestamp": "...", "message": "Pod set diff: added=[('nvidia-dcgm-exporter-yyyyy', 'dcgm_exporter')]"}
{"timestamp": "...", "message": "Vector config written. Scraped pods by type: {'dcgm_exporter': 1}"}The DCGM exporter wasn't ready on the first cycle, so the config-reloader wrote an empty config initially. On the next cycle, it discovered the exporter and wrote the full config. Vector's file watcher was fully initialised by then and caught the second write.
Applying the fix:
$ kubectl get pods -n crusoe-system -l app.kubernetes.io/name=crusoe-watch-agent -o name \
| xargs -I{} kubectl exec -n crusoe-system {} -c vector-config-reloader \
-- touch /etc/vector/vector.yamlAfter running, all nodes reported DCGM metrics in Grafana within a few minutes.
Related Articles
- crusoe-watch-agent solution — deployment instructions and configuration
- grafana-cmk solution — self-hosted Grafana on Crusoe Managed Kubernetes
- Crusoe Support Access — how to grant Crusoe support access to your cluster
- Crusoe Solutions Library