Skip to main content
Crusoe Support Help Center home page
Crusoe

Resolving MOFED Storage Module Race Condition on CMK GPU Nodes

Rishabh Sinha
Rishabh Sinha
Updated

Last Updated: Mar 31, 2026

Introduction

On Crusoe Managed Kubernetes (CMK) GPU clusters, the MOFED (Mellanox OpenFabrics Enterprise Distribution) DaemonSet can fail to install when in-tree RDMA storage modules are already loaded in the kernel. This happens when a Crusoe shared volume is mounted on a node before the Network Operator deploys the MOFED pod, the NFS mount loads rpcrdma and its RDMA dependency chain, and the MOFED installer refuses to proceed because it finds these modules already present. 

The failure is a race condition: if the NFS mount wins the race against MOFED, the node loses all GPU capacity. The fix is to set UNLOAD_STORAGE_MODULES=true in the NicClusterPolicy. This article explains the symptoms, root cause, fix, and known side-effects based on internal investigation and testing on B200 nodes.

Affected Configurations

  • CMK clusters with GPU node pools (H200, B200, or any SXM IB SKU)
  • Shared volumes attached to GPU worker nodes
  • NVIDIA Network Operator with MOFED DaemonSet
  • Kernel: 5.15.0-170-generic (standard CMK worker image)

Symptoms

When the MOFED pod starts on an affected node, it exits immediately with this error:
Storage modules are loaded for current driver, terminating
prior driver reload failure due to UNLOAD_STORAGE_MODULES
not set to "true"
The MOFED pod enters CrashLoopBackOff. This causes a cascading failure:
  1. The MOFED pod crash-loops, so the network.nvidia.com/operator.mofed.wait label stays true.
  2. The SR-IOV device plugin never deploys because it waits on MOFED.
  3. nvidia.com/hostdev remains at 0 (or null) on the node.
  4. nvidia.com/gpu disappears from node capacity and allocatable resources.
  5. The node is effectively dead for GPU workloads.
You can confirm the issue by checking:
# MOFED pod status
$ kubectl get pods -n nvidia-network-operator | grep mofed

# hostdev allocation on the node
$ kubectl get nodes -o json | jq '.items[] | {name: .metadata.name, hostdev: .status.allocatable["nvidia.com/hostdev"]}'

# Which storage modules are loaded on the affected node (via node debug shell)
$ lsmod | egrep 'ib_isert|nvme_rdma|nvmet_rdma|rpcrdma|xprtrdma|ib_srpt'
On an affected node, rpcrdma will be loaded. On healthy nodes in the same cluster, it will not be.

Root Cause

The Race Condition

When a shared volume is mounted on a node before the Network Operator is installed and the MOFED pod is up, the mount triggers the kernel to load rpcrdma and its RDMA dependency chain. MOFED needs to install these NFS-related kernel modules itself via DKMS, but because they are already loaded, the installer refuses to proceed.

The rpcrdma module that blocks MOFED is not a stock Ubuntu in-tree module, it is built and shipped by MOFED itself as part of a VAST-customized NFS bundle. The module lives at:

/lib/modules/5.15.0-170-generic/updates/dkms/rpcrdma.ko
When loaded, it prints nfs-bundle: rpcrdma loading in dmesg. The full DKMS directory contains 15 NFS-related modules, all branded with VAST (sunrpc identifies as version 4.0.35-vastdata). linux-modules-extra is not installed on the CMK worker image, confirming these modules come exclusively from a previous MOFED DKMS build. MOFED is blocking on its own modules.
 
Module Dependency Chain

When rpcrdma loads, it pulls in a hybrid set of in-tree and DKMS modules:

Module Source Path
ib_core In-tree (Ubuntu) /kernel/drivers/infiniband/core/
ib_cm In-tree (Ubuntu) /kernel/drivers/infiniband/core/
rdma_cm In-tree (Ubuntu) /kernel/drivers/infiniband/core/
sunrpc DKMS / MOFED /updates/dkms/
rpcrdma DKMS / MOFED /updates/dkms/
The in-tree ib_core and rdma_cm modules are what MOFED's installer considers "storage modules" and blocks on. The MOFED installer checks for these specifically:
storage_modules_loaded=$(lsmod | egrep
    'ib_isert|nvme_rdma|nvmet_rdma|rpcrdma|xprtrdma|ib_srpt' -c)
What Triggers the Load
When a shared volume is attached to a node, the NFS mount triggers sunrpc and subsequently rpcrdma to load. If this happens before the Network Operator has deployed the MOFED pod, MOFED will find these modules already loaded and refuse to install. The boot sequence observed in testing:
Time (approx.) Event Source
~20s mlx5_core loads (Mellanox NIC driver) Kernel PCI enumeration
~30s sunrpc loads (version 4.0.35-vastdata) NFS mount / DKMS
~1107s rpcrdma loads NFS mount dependency chain
Later MOFED DaemonSet pod starts Kubernetes scheduler
Any node where a shared volume is mounted before the Network Operator is ready is vulnerable to this race condition. The timing depends on when the Kubernetes scheduler places the MOFED pod relative to when the NFS mount triggers rpcrdma loading.
 
Note: Not every node in a cluster will be affected. The healthy nodes will win the race,  MOFED installed before the shared volume triggered rpcrdma.

Resolution

Set UNLOAD_STORAGE_MODULES=true in the NicClusterPolicy ofedDriver spec using the following steps:

  1. Export the current NicClusterPolicy to a file

    kubectl get nicclusterpolicy nic-cluster-policy -o yaml > nic-cluster-policy.yaml
  2. Edit the file

    Open nic-cluster-policy.yaml and add the following under spec.ofedDriver.env (create the env key if it doesn't exist):

    ofedDriver:
      image: doca-driver
      repository: nvcr.io/nvidia/mellanox
      version: 25.01-0.6.0.0-0
      env:
        - name: UNLOAD_STORAGE_MODULES  # <---- add this
          value: "true"                 # <---- add this

    This tells the MOFED installer to unload the conflicting storage modules before proceeding with driver installation, rather than refusing and exiting.

  3. Apply the updated file

    kubectl apply -f nic-cluster-policy.yaml

Note: Using a file-based apply (rather than kubectl patch) is the recommended approach as it preserves any existing environment variables you may already have configured under ofedDriver.

 
After applying this change, the MOFED pod on the affected node will restart, unload the conflicting modules, install its own DKMS drivers, and the SR-IOV device plugin will deploy. nvidia.com/hostdev will become available and GPU workloads can schedule.

Known Side-Effects
  • If the MOFED pod restarts or network operator is upgraded after initial installation (e.g., node reboot, pod eviction), it will unload RDMA storage modules including rpcrdma.

  • Any workloads actively writing to NFS-mounted shared volumes will experience a temporary I/O stall while the HCA driver is reloaded.

  • Depending on what and how the user workload writes to the shared mount, it is prone to failure.

  • It is recommended to pause any workloads from writing to shared volumes during network upgrades.

  • In testing, this stall lasted 27 seconds. No data was lost, no errors were raised, and no application crashes occurred. The NFS client's hard-mount retry mechanism absorbs the transport disruption transparently, but it is not guaranteed to do so for all types of workloads.

Immediate Workaround (Without Applying the Fix)

If you need to unblock a node immediately without modifying the NicClusterPolicy:
  1. Recreate the affected VM : Delete the affected node from the node pool and let CMK provision a replacement. If MOFED wins the race on the new node (which is the common case), it will come up healthy.

  2. Ensure shared volumes are not mounted before MOFED : If possible, delay shared volume attachment until after the NVIDIA Network Operator pods are running on the node.

These are temporary workarounds. The UNLOAD_STORAGE_MODULES=true fix is the recommended permanent solution. If you are still seeing this issue, please reach out to Crusoe Cloud support
 

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.