Last Updated: March 27, 2026
Introduction
This guide explains how to enable nvidia-peermem (legacy RDMA) on Crusoe Managed Kubernetes (CMK) clusters running the NVIDIA GPU Operator. PeerMem enables GPU Direct RDMA, allowing GPUs to communicate directly with network adapters (e.g., Mellanox/ConnectX InfiniBand) without going through system memory which is critical for high-performance distributed workloads.
A common misconception is that setting driver.rdma.enabled=true on the ClusterPolicy is sufficient. In GPU Operator versions that use the NVIDIADriver custom resource (CR), the ClusterPolicy change does not propagate to the driver DaemonSet. The correct target is the NVIDIADriver CR directly.
Prerequisites
- Access to the CMK cluster via
kubectl - The NVIDIA GPU Operator add-on installed on the cluster
-
The cluster must have
NVIDIADriverCRDs present — verify with:$ kubectl get crd | grep -i nvidia
You should see
nvidiadrivers.nvidia.comin the output. -
The InfiniBand/RDMA stack should be healthy on the nodes. Verify with:
$ lsmod | grep ib_core
-
Example output:
ib_core 417792 8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm mlx_compat 69632 12 rdma_cm,ib_ipoib,mlxdevm,mlxfw,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core
Step-by-Step Instructions
1. Confirm the NVIDIADriver CR name
The driver DaemonSet is owned by a NVIDIADriver CR, not directly by the ClusterPolicy. First, identify the CR name:
$ kubectl get nvidiadriver -A
Note the name of the CR (e.g., b200). You can also trace ownership from the driver DaemonSet:
# Find the name of nvidia gpu driver daemonset $ kubectl get ds -n nvidia-gpu-operator | grep nvidia-gpu-driver # Replace <daemonset-name> with the name found in the previous command $ kubectl get ds <daemonset-name> -n nvidia-gpu-operator -o yaml | grep -A5 ownerReferences
This will show the NVIDIADriver CR name under ownerReferences.
2. Patch the NVIDIADriver CR to enable RDMA
Patch the NVIDIADriver CR (Note: replace b200 with your CR name if different):
$ kubectl patch nvidiadriver b200 \
--type merge \
-p '{"spec":{"rdma":{"enabled":true}}}'
Expected output:
nvidiadriver.nvidia.com/b200 patchedNote: Do not patch
ClusterPolicyalone - in clusters using theNVIDIADriverCR, this will not trigger DaemonSet reconciliation.
3. Verify the DaemonSet has been reconciled
After patching, confirm that the driver DaemonSet now includes the nvidia-peermem-ctr container:
$ kubectl get ds <daemonset-name> -n nvidia-gpu-operator \
-o jsonpath='{.spec.template.spec.containers[*].name}'
Expected output:
nvidia-driver-ctr nvidia-peermem-ctrIf the DaemonSet update strategy is OnDelete, existing pods will not be restarted automatically — you may need to delete driver pods manually to trigger the update:
# Get pod name $ kubectl get pods -n nvidia-gpu-operator | grep nvidia-gpu-driver # delete pod $ kubectl delete pod <driver-pod-name> -n nvidia-gpu-operator
4. Confirm peermem is loaded successfully
Check the logs of the nvidia-peermem-ctr container in the driver pod:
$ kubectl logs -n nvidia-gpu-operator <driver-pod-name> -c nvidia-peermem-ctr Expected output: DRIVER_ARCH is x86_64 ... waiting for mellanox ofed and nvidia drivers to be installed waiting for mellanox ofed and nvidia drivers to be installed ... successfully loaded nvidia-peermem module, now waiting for signal
Note: The
nvidia-peermem-ctrcontainer may restart once on initial startup due to a module load race condition. This is expected - the container stabilizes after the main driver container finishes loading.
5. Verify on the node
SSH into the node and confirm the module is loaded:
$ lsmod | grep nvidia_peermem Expected output: nvidia_peermem 16384 0 nvidia 11526144 77 nvidia_uvm,nvidia_peermem,nvidia_modeset ib_uverbs 163840 3 nvidia_peermem,rdma_ucm,mlx5_ib
Optionally, run the following to confirm RDMA devices are visible and properly configured:
# Install (if not already done) $ sudo apt install ibverbs-utils # Then run the following command $ ibv_devinfo -v
A healthy setup will show each HCA with valid gid_table entries and active ports.
Common Issues
Issue: Patching ClusterPolicy has no effect on the DaemonSet
The ClusterPolicy change does not propagate when a NVIDIADriver CR is present. Target the NVIDIADriver CR instead (see Step 2).
Issue: nvidia-peermem-ctr restarts once on startup
This is expected. The container runs reload_nvidia_peermem and may fail on the first attempt if the main driver container hasn't finished loading modules yet. It will stabilize automatically - confirm with kubectl logs.
Issue: DaemonSet not updating after patch
Check the DaemonSet's updateStrategy. If it is OnDelete, pods must be deleted manually to pick up the new spec. Delete the driver pod and allow it to be rescheduled.