Introduction
If your CMK nodes are showing nvidia.com/gpu: 8 but nvidia.com/hostdev: 0, your GPUs are healthy but InfiniBand and RDMA host devices are not being advertised to the Kubernetes scheduler. This means any workload requiring IB/RDMA will fail to schedule, even though nvidia-smi and the NVIDIA device plugin look completely clean.
This issue commonly surfaces after a VM reset. If a NoSchedule taint was applied to the node before the reset — for example, by an automated health monitoring system marking a node as degraded — the taint persists across the reset. The SR-IOV device plugin cannot reschedule onto the node when it comes back up, so nvidia.com/hostdev stays at 0 even though the node otherwise appears healthy.
The NVIDIA device plugin (which registers nvidia.com/gpu) and the SR-IOV device plugin (which registers nvidia.com/hostdev) are two independent registration paths. A problem with one does not affect the other.
Prerequisites
-
kubectlAccess to the Affected CMK Cluster - Access to the
nvidia-gpu-operatorandsriov-network-operatorNamespaces
Instructions
-
Check Node Capacity for the hostdev Discrepancy
-
Run the following on the affected node:
kubectl get node <node-name> -o json | jq '.status.capacity, .status.allocatable'
-
On a healthy node, both
nvidia.com/gpuandnvidia.com/hostdevshould show 8. The pattern to look for is:{ "nvidia.com/gpu": "8", "nvidia.com/hostdev": "0" } -
To sweep the entire cluster for affected nodes at once:
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia.com/gpu,HOSTDEV:.status.capacity.nvidia.com/hostdev" | awk '$3 != 8'
-
-
Confirm GPUs and the NVIDIA Device Plugin Are Healthy
-
Pull
nvidia-smioutput and device plugin logs from the affected node:kubectl get pods -n nvidia-gpu-operator -l app=nvidia-device-plugin-daemonset -o wide | grep <node-name> kubectl logs -n nvidia-gpu-operator <pod-name> --tail=200
- In this failure mode you will see all 8 GPUs present and healthy in
nvidia-smi, and clean registration with no Xid errors in the device plugin logs. If everything at the GPU layer looks fine buthostdevis still 0, this points to the SR-IOV issue rather than a GPU hardware problem.
-
-
Check for a NoSchedule Taint on the Node
-
Run:
kubectl describe node <node-name> | grep Taints
- If you see a customer-applied
NoScheduletaint (for example, a taint your orchestration system adds to mark bad nodes), that taint is likely blocking the SR-IOV device plugin from scheduling on the node after a reset.
-
-
Confirm the SR-IOV Pod Is Missing on the Node
-
Run:
kubectl get pods -n sriov-network-operator -o wide | grep <node-name>
- If no pod is listed for that node, the SR-IOV device plugin is not running there and
nvidia.com/hostdevwill not be registered regardless of GPU health. ℹ️ Note: The SR-IOV DaemonSet's overall desired/scheduled count may look healthy in
kubectl get daemonsetbecause tainted nodes are excluded from the count entirely rather than showing as unscheduled. You must check per-node directly.
-
-
Resolve the Issue
- There are two paths depending on whether the taint is still needed:
-
If the taint is no longer needed, remove it yourself and the SR-IOV pod will reschedule automatically. The
hostdevcapacity will restore to 8 without any VM reset or deletion:kubectl taint nodes <node-name> <taint-key>-
-
If the taint is intentional and you want to keep it, open a Crusoe support ticket and include:
- Affected node names
- The taint key and value identified in Step 3
- Confirmation that
nvidia.com/gpu: 8is healthy butnvidia.com/hostdev: 0
Crusoe Support will add a toleration for the taint to the NicClusterPolicy. Once applied, the SR-IOV plugin will reschedule onto the node and
hostdevcapacity will restore automatically — no VM reset or deletion needed.
-
- There are two paths depending on whether the taint is still needed:
Example
A customer's automated health monitoring system detects degraded nodes and applies a NoSchedule taint to mark them out of service. A VM reset is performed on the affected nodes as part of the recovery process, but because the taint persists across the reset, the SR-IOV device plugin is never able to reschedule onto those nodes after they come back up. nvidia-smi shows all 8 GPUs healthy and the NVIDIA device plugin logs show clean registration, but workloads requiring InfiniBand continue to fail to schedule. Checking kubectl get node -o json reveals nvidia.com/hostdev: 0 on all affected nodes. Confirming via kubectl get pods -n sriov-network-operator shows no SR-IOV pod running on those nodes. A support ticket is opened and the CMK team adds a toleration for the taint to the NicClusterPolicy, restoring hostdev capacity without any VM migration.