How-To Diagnose nvidia.com/hostdev: 0 on CMK Nodes

Introduction

If your CMK nodes are showing nvidia.com/gpu: 8 but nvidia.com/hostdev: 0, your GPUs are healthy but InfiniBand and RDMA host devices are not being advertised to the Kubernetes scheduler. This means any workload requiring IB/RDMA will fail to schedule, even though nvidia-smi and the NVIDIA device plugin look completely clean.

This issue commonly surfaces after a VM reset. If a NoSchedule taint was applied to the node before the reset — for example, by an automated health monitoring system marking a node as degraded — the taint persists across the reset. The SR-IOV device plugin cannot reschedule onto the node when it comes back up, so nvidia.com/hostdev stays at 0 even though the node otherwise appears healthy.

The NVIDIA device plugin (which registers nvidia.com/gpu) and the SR-IOV device plugin (which registers nvidia.com/hostdev) are two independent registration paths. A problem with one does not affect the other.

Prerequisites

kubectl Access to the Affected CMK Cluster
Access to the nvidia-gpu-operator and sriov-network-operator Namespaces

Instructions

Check Node Capacity for the hostdev Discrepancy

Run the following on the affected node:

kubectl get node <node-name> -o json | jq '.status.capacity, .status.allocatable'

On a healthy node, both nvidia.com/gpu and nvidia.com/hostdev should show 8. The pattern to look for is:
```
{ "nvidia.com/gpu": "8", "nvidia.com/hostdev": "0" }
```

To sweep the entire cluster for affected nodes at once:

kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia.com/gpu,HOSTDEV:.status.capacity.nvidia.com/hostdev" | awk '$3 != 8'

Confirm GPUs and the NVIDIA Device Plugin Are Healthy
- Pull nvidia-smi output and device plugin logs from the affected node:
```
kubectl get pods -n nvidia-gpu-operator -l app=nvidia-device-plugin-daemonset -o wide | grep <node-name>
kubectl logs -n nvidia-gpu-operator <pod-name> --tail=200
```
- In this failure mode you will see all 8 GPUs present and healthy in nvidia-smi, and clean registration with no Xid errors in the device plugin logs. If everything at the GPU layer looks fine but hostdev is still 0, this points to the SR-IOV issue rather than a GPU hardware problem.
Check for a NoSchedule Taint on the Node
- Run:
```
kubectl describe node <node-name> | grep Taints
```
- If you see a customer-applied NoSchedule taint (for example, a taint your orchestration system adds to mark bad nodes), that taint is likely blocking the SR-IOV device plugin from scheduling on the node after a reset.
Confirm the SR-IOV Pod Is Missing on the Node
- Run:
```
kubectl get pods -n sriov-network-operator -o wide | grep <node-name>
```
- If no pod is listed for that node, the SR-IOV device plugin is not running there and nvidia.com/hostdev will not be registered regardless of GPU health.
- ℹ️ Note: The SR-IOV DaemonSet's overall desired/scheduled count may look healthy in kubectl get daemonset because tainted nodes are excluded from the count entirely rather than showing as unscheduled. You must check per-node directly.
Resolve the Issue
- There are two paths depending on whether the taint is still needed:
  - If the taint is no longer needed, remove it yourself and the SR-IOV pod will reschedule automatically. The hostdev capacity will restore to 8 without any VM reset or deletion:
```
kubectl taint nodes <node-name> <taint-key>-
```
  - If the taint is intentional and you want to keep it, open a Crusoe support ticket and include:
    - Affected node names
    - The taint key and value identified in Step 3
    - Confirmation that nvidia.com/gpu: 8 is healthy but nvidia.com/hostdev: 0
    Crusoe Support will add a toleration for the taint to the NicClusterPolicy. Once applied, the SR-IOV plugin will reschedule onto the node and hostdev capacity will restore automatically — no VM reset or deletion needed.

Example

A customer's automated health monitoring system detects degraded nodes and applies a NoSchedule taint to mark them out of service. A VM reset is performed on the affected nodes as part of the recovery process, but because the taint persists across the reset, the SR-IOV device plugin is never able to reschedule onto those nodes after they come back up. nvidia-smi shows all 8 GPUs healthy and the NVIDIA device plugin logs show clean registration, but workloads requiring InfiniBand continue to fail to schedule. Checking kubectl get node -o json reveals nvidia.com/hostdev: 0 on all affected nodes. Confirming via kubectl get pods -n sriov-network-operator shows no SR-IOV pod running on those nodes. A support ticket is opened and the CMK team adds a toleration for the taint to the NicClusterPolicy, restoring hostdev capacity without any VM migration.

Collecting NVIDIA Bug Report in CMK

Related to

how-to cmk

Introduction

Prerequisites

Instructions

Example

Related Articles

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments