Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Diagnose nvidia.com/hostdev: 0 on CMK Nodes

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Introduction

If your CMK nodes are showing nvidia.com/gpu: 8 but nvidia.com/hostdev: 0, your GPUs are healthy but InfiniBand and RDMA host devices are not being advertised to the Kubernetes scheduler. This means any workload requiring IB/RDMA will fail to schedule, even though nvidia-smi and the NVIDIA device plugin look completely clean.

This issue commonly surfaces after a VM reset. If a NoSchedule taint was applied to the node before the reset — for example, by an automated health monitoring system marking a node as degraded — the taint persists across the reset. The SR-IOV device plugin cannot reschedule onto the node when it comes back up, so nvidia.com/hostdev stays at 0 even though the node otherwise appears healthy.

The NVIDIA device plugin (which registers nvidia.com/gpu) and the SR-IOV device plugin (which registers nvidia.com/hostdev) are two independent registration paths. A problem with one does not affect the other.

Prerequisites

  • kubectl Access to the Affected CMK Cluster
  • Access to the nvidia-gpu-operator and sriov-network-operator Namespaces

Instructions

  1. Check Node Capacity for the hostdev Discrepancy
    • Run the following on the affected node:

      kubectl get node <node-name> -o json | jq '.status.capacity, .status.allocatable'
    • On a healthy node, both nvidia.com/gpu and nvidia.com/hostdev should show 8. The pattern to look for is:

      { "nvidia.com/gpu": "8", "nvidia.com/hostdev": "0" }
    • To sweep the entire cluster for affected nodes at once:

      kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.capacity.nvidia.com/gpu,HOSTDEV:.status.capacity.nvidia.com/hostdev" | awk '$3 != 8'
  2. Confirm GPUs and the NVIDIA Device Plugin Are Healthy
    • Pull nvidia-smi output and device plugin logs from the affected node:

      kubectl get pods -n nvidia-gpu-operator -l app=nvidia-device-plugin-daemonset -o wide | grep <node-name>
      kubectl logs -n nvidia-gpu-operator <pod-name> --tail=200
    • In this failure mode you will see all 8 GPUs present and healthy in nvidia-smi, and clean registration with no Xid errors in the device plugin logs. If everything at the GPU layer looks fine but hostdev is still 0, this points to the SR-IOV issue rather than a GPU hardware problem.
  3. Check for a NoSchedule Taint on the Node
    • Run:

      kubectl describe node <node-name> | grep Taints
    • If you see a customer-applied NoSchedule taint (for example, a taint your orchestration system adds to mark bad nodes), that taint is likely blocking the SR-IOV device plugin from scheduling on the node after a reset.
  4. Confirm the SR-IOV Pod Is Missing on the Node
    • Run:

      kubectl get pods -n sriov-network-operator -o wide | grep <node-name>
    • If no pod is listed for that node, the SR-IOV device plugin is not running there and nvidia.com/hostdev will not be registered regardless of GPU health.
    • ℹ️ Note: The SR-IOV DaemonSet's overall desired/scheduled count may look healthy in kubectl get daemonset because tainted nodes are excluded from the count entirely rather than showing as unscheduled. You must check per-node directly.

  5. Resolve the Issue
    • There are two paths depending on whether the taint is still needed:
      • If the taint is no longer needed, remove it yourself and the SR-IOV pod will reschedule automatically. The hostdev capacity will restore to 8 without any VM reset or deletion:

        kubectl taint nodes <node-name> <taint-key>-
      • If the taint is intentional and you want to keep it, open a Crusoe support ticket and include:

        • Affected node names
        • The taint key and value identified in Step 3
        • Confirmation that nvidia.com/gpu: 8 is healthy but nvidia.com/hostdev: 0

        Crusoe Support will add a toleration for the taint to the NicClusterPolicy. Once applied, the SR-IOV plugin will reschedule onto the node and hostdev capacity will restore automatically — no VM reset or deletion needed.

Example

A customer's automated health monitoring system detects degraded nodes and applies a NoSchedule taint to mark them out of service. A VM reset is performed on the affected nodes as part of the recovery process, but because the taint persists across the reset, the SR-IOV device plugin is never able to reschedule onto those nodes after they come back up. nvidia-smi shows all 8 GPUs healthy and the NVIDIA device plugin logs show clean registration, but workloads requiring InfiniBand continue to fail to schedule. Checking kubectl get node -o json reveals nvidia.com/hostdev: 0 on all affected nodes. Confirming via kubectl get pods -n sriov-network-operator shows no SR-IOV pod running on those nodes. A support ticket is opened and the CMK team adds a toleration for the taint to the NicClusterPolicy, restoring hostdev capacity without any VM migration.

Related Articles

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Related Articles

Recently Viewed

Comments

0 comments

Article is closed for comments.