Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Enable NVIDIA PeerMem (Legacy RDMA) on CMK Clusters

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Last Updated: March 27, 2026

Introduction

This guide explains how to enable nvidia-peermem (legacy RDMA) on Crusoe Managed Kubernetes (CMK) clusters running the NVIDIA GPU Operator. PeerMem enables GPU Direct RDMA, allowing GPUs to communicate directly with network adapters (e.g., Mellanox/ConnectX InfiniBand) without going through system memory which is critical for high-performance distributed workloads.

A common misconception is that setting driver.rdma.enabled=true on the ClusterPolicy is sufficient. In GPU Operator versions that use the NVIDIADriver custom resource (CR), the ClusterPolicy change does not propagate to the driver DaemonSet. The correct target is the NVIDIADriver CR directly.

Prerequisites

  • Access to the CMK cluster via kubectl
  • The NVIDIA GPU Operator add-on installed on the cluster
  • The cluster must have NVIDIADriver CRDs present — verify with:

    $ kubectl get crd | grep -i nvidia
    

    You should see nvidiadrivers.nvidia.com in the output.

  • The InfiniBand/RDMA stack should be healthy on the nodes. Verify with:

    $ lsmod | grep ib_core
  • Example output: 

    ib_core               417792  8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
    mlx_compat             69632  12 rdma_cm,ib_ipoib,mlxdevm,mlxfw,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

Step-by-Step Instructions

1. Confirm the NVIDIADriver CR name

The driver DaemonSet is owned by a NVIDIADriver CR, not directly by the ClusterPolicy. First, identify the CR name:

$ kubectl get nvidiadriver -A

Note the name of the CR (e.g., b200). You can also trace ownership from the driver DaemonSet:

# Find the name of nvidia gpu driver daemonset
$ kubectl get ds -n nvidia-gpu-operator | grep nvidia-gpu-driver

# Replace <daemonset-name> with the name found in the previous command
$ kubectl get ds <daemonset-name> -n nvidia-gpu-operator -o yaml | grep -A5 ownerReferences

This will show the NVIDIADriver CR name under ownerReferences.

2. Patch the NVIDIADriver CR to enable RDMA

Patch the NVIDIADriver CR (Note: replace b200 with your CR name if different):

$ kubectl patch nvidiadriver b200 \
  --type merge \
  -p '{"spec":{"rdma":{"enabled":true}}}'

Expected output:

nvidiadriver.nvidia.com/b200 patched

Note: Do not patch ClusterPolicy alone - in clusters using the NVIDIADriver CR, this will not trigger DaemonSet reconciliation.

3. Verify the DaemonSet has been reconciled

After patching, confirm that the driver DaemonSet now includes the nvidia-peermem-ctr container:

$ kubectl get ds <daemonset-name> -n nvidia-gpu-operator \
  -o jsonpath='{.spec.template.spec.containers[*].name}'

Expected output:

nvidia-driver-ctr nvidia-peermem-ctr

If the DaemonSet update strategy is OnDelete, existing pods will not be restarted automatically — you may need to delete driver pods manually to trigger the update:

# Get pod name 
$ kubectl get pods -n nvidia-gpu-operator | grep nvidia-gpu-driver

# delete pod
$ kubectl delete pod <driver-pod-name> -n nvidia-gpu-operator

4. Confirm peermem is loaded successfully

Check the logs of the nvidia-peermem-ctr container in the driver pod:

$ kubectl logs -n nvidia-gpu-operator <driver-pod-name> -c nvidia-peermem-ctr

Expected output:

DRIVER_ARCH is x86_64
...
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
...
successfully loaded nvidia-peermem module, now waiting for signal

Note: The nvidia-peermem-ctr container may restart once on initial startup due to a module load race condition. This is expected - the container stabilizes after the main driver container finishes loading.

5. Verify on the node

SSH into the node and confirm the module is loaded:

$ lsmod | grep nvidia_peermem

Expected output:

nvidia_peermem         16384  0
nvidia              11526144  77 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_uverbs             163840  3 nvidia_peermem,rdma_ucm,mlx5_ib

Optionally, run the following to confirm RDMA devices are visible and properly configured:

# Install (if not already done) 
$ sudo apt install ibverbs-utils

# Then run the following command
$ ibv_devinfo -v

A healthy setup will show each HCA with valid gid_table entries and active ports.

Common Issues

Issue: Patching ClusterPolicy has no effect on the DaemonSet 

The ClusterPolicy change does not propagate when a NVIDIADriver CR is present. Target the NVIDIADriver CR instead (see Step 2).

Issue: nvidia-peermem-ctr restarts once on startup 

This is expected. The container runs reload_nvidia_peermem and may fail on the first attempt if the main driver container hasn't finished loading modules yet. It will stabilize automatically - confirm with kubectl logs.

Issue: DaemonSet not updating after patch 

Check the DaemonSet's updateStrategy. If it is OnDelete, pods must be deleted manually to pick up the new spec. Delete the driver pod and allow it to be rescheduled.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.