Last Updated: Mar 31, 2026
Introduction
On Crusoe Managed Kubernetes (CMK) GPU clusters, the MOFED (Mellanox OpenFabrics Enterprise Distribution) DaemonSet can fail to install when in-tree RDMA storage modules are already loaded in the kernel. This happens when a Crusoe shared volume is mounted on a node before the Network Operator deploys the MOFED pod, the NFS mount loads rpcrdma and its RDMA dependency chain, and the MOFED installer refuses to proceed because it finds these modules already present.
The failure is a race condition: if the NFS mount wins the race against MOFED, the node loses all GPU capacity. The fix is to set UNLOAD_STORAGE_MODULES=true in the NicClusterPolicy. This article explains the symptoms, root cause, fix, and known side-effects based on internal investigation and testing on B200 nodes.
Affected Configurations
- CMK clusters with GPU node pools (
H200,B200, or any SXM IB SKU) - Shared volumes attached to GPU worker nodes
- NVIDIA Network Operator with MOFED DaemonSet
- Kernel:
5.15.0-170-generic(standard CMK worker image)
Symptoms
Storage modules are loaded for current driver, terminating
prior driver reload failure due to UNLOAD_STORAGE_MODULES
not set to "true"CrashLoopBackOff. This causes a cascading failure:- The MOFED pod crash-loops, so the
network.nvidia.com/operator.mofed.waitlabel staystrue. - The SR-IOV device plugin never deploys because it waits on MOFED.
nvidia.com/hostdevremains at 0 (or null) on the node.nvidia.com/gpudisappears from node capacity and allocatable resources.- The node is effectively dead for GPU workloads.
# MOFED pod status
$ kubectl get pods -n nvidia-network-operator | grep mofed
# hostdev allocation on the node
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, hostdev: .status.allocatable["nvidia.com/hostdev"]}'
# Which storage modules are loaded on the affected node (via node debug shell)
$ lsmod | egrep 'ib_isert|nvme_rdma|nvmet_rdma|rpcrdma|xprtrdma|ib_srpt'rpcrdma will be loaded. On healthy nodes in the same cluster, it will not be.Root Cause
The Race Condition
When a shared volume is mounted on a node before the Network Operator is installed and the MOFED pod is up, the mount triggers the kernel to load rpcrdma and its RDMA dependency chain. MOFED needs to install these NFS-related kernel modules itself via DKMS, but because they are already loaded, the installer refuses to proceed.
The rpcrdma module that blocks MOFED is not a stock Ubuntu in-tree module, it is built and shipped by MOFED itself as part of a VAST-customized NFS bundle. The module lives at:
/lib/modules/5.15.0-170-generic/updates/dkms/rpcrdma.konfs-bundle: rpcrdma loading in dmesg. The full DKMS directory contains 15 NFS-related modules, all branded with VAST (sunrpc identifies as version 4.0.35-vastdata). linux-modules-extra is not installed on the CMK worker image, confirming these modules come exclusively from a previous MOFED DKMS build. MOFED is blocking on its own modules.Module Dependency Chain
When rpcrdma loads, it pulls in a hybrid set of in-tree and DKMS modules:
| Module | Source | Path |
ib_core |
In-tree (Ubuntu) | /kernel/drivers/infiniband/core/ |
ib_cm |
In-tree (Ubuntu) | /kernel/drivers/infiniband/core/ |
rdma_cm |
In-tree (Ubuntu) | /kernel/drivers/infiniband/core/ |
sunrpc |
DKMS / MOFED | /updates/dkms/ |
rpcrdma |
DKMS / MOFED | /updates/dkms/ |
ib_core and rdma_cm modules are what MOFED's installer considers "storage modules" and blocks on. The MOFED installer checks for these specifically:storage_modules_loaded=$(lsmod | egrep
'ib_isert|nvme_rdma|nvmet_rdma|rpcrdma|xprtrdma|ib_srpt' -c)
What Triggers the Load
sunrpc and subsequently rpcrdma to load. If this happens before the Network Operator has deployed the MOFED pod, MOFED will find these modules already loaded and refuse to install. The boot sequence observed in testing:| Time (approx.) | Event | Source |
| ~20s | mlx5_core loads (Mellanox NIC driver) | Kernel PCI enumeration |
| ~30s | sunrpc loads (version 4.0.35-vastdata) | NFS mount / DKMS |
| ~1107s | rpcrdma loads | NFS mount dependency chain |
| Later | MOFED DaemonSet pod starts | Kubernetes scheduler |
rpcrdma loading.rpcrdma.Resolution
Set UNLOAD_STORAGE_MODULES=true in the NicClusterPolicy ofedDriver spec using the following steps:
-
Export the current NicClusterPolicy to a file
kubectl get nicclusterpolicy nic-cluster-policy -o yaml > nic-cluster-policy.yaml
-
Edit the file
Open
nic-cluster-policy.yamland add the following underspec.ofedDriver.env(create theenvkey if it doesn't exist):ofedDriver: image: doca-driver repository: nvcr.io/nvidia/mellanox version: 25.01-0.6.0.0-0 env: - name: UNLOAD_STORAGE_MODULES # <---- add this value: "true" # <---- add thisThis tells the MOFED installer to unload the conflicting storage modules before proceeding with driver installation, rather than refusing and exiting.
-
Apply the updated file
kubectl apply -f nic-cluster-policy.yaml
Note: Using a file-based apply (rather than kubectl patch) is the recommended approach as it preserves any existing environment variables you may already have configured under ofedDriver.
nvidia.com/hostdev will become available and GPU workloads can schedule.
Known Side-Effects
If the MOFED pod restarts or network operator is upgraded after initial installation (e.g., node reboot, pod eviction), it will unload RDMA storage modules including
rpcrdma.Any workloads actively writing to NFS-mounted shared volumes will experience a temporary I/O stall while the HCA driver is reloaded.
Depending on what and how the user workload writes to the shared mount, it is prone to failure.
It is recommended to pause any workloads from writing to shared volumes during network upgrades.
In testing, this stall lasted 27 seconds. No data was lost, no errors were raised, and no application crashes occurred. The NFS client's hard-mount retry mechanism absorbs the transport disruption transparently, but it is not guaranteed to do so for all types of workloads.
Immediate Workaround (Without Applying the Fix)
NicClusterPolicy:Recreate the affected VM : Delete the affected node from the node pool and let CMK provision a replacement. If MOFED wins the race on the new node (which is the common case), it will come up healthy.
Ensure shared volumes are not mounted before MOFED : If possible, delay shared volume attachment until after the NVIDIA Network Operator pods are running on the node.
UNLOAD_STORAGE_MODULES=true fix is the recommended permanent solution. If you are still seeing this issue, please reach out to Crusoe Cloud support