How-To Enable NVIDIA PeerMem (Legacy RDMA) on CMK Clusters

Last Updated: March 27, 2026

Introduction

This guide explains how to enable nvidia-peermem (legacy RDMA) on Crusoe Managed Kubernetes (CMK) clusters running the NVIDIA GPU Operator. PeerMem enables GPU Direct RDMA, allowing GPUs to communicate directly with network adapters (e.g., Mellanox/ConnectX InfiniBand) without going through system memory which is critical for high-performance distributed workloads.

A common misconception is that setting driver.rdma.enabled=true on the ClusterPolicy is sufficient. In GPU Operator versions that use the NVIDIADriver custom resource (CR), the ClusterPolicy change does not propagate to the driver DaemonSet. The correct target is the NVIDIADriver CR directly.

Prerequisites

Access to the CMK cluster via kubectl
The NVIDIA GPU Operator add-on installed on the cluster
The cluster must have NVIDIADriver CRDs present — verify with:
```
$ kubectl get crd | grep -i nvidia
```
You should see nvidiadrivers.nvidia.com in the output.
The InfiniBand/RDMA stack should be healthy on the nodes. Verify with:
```
$ lsmod | grep ib_core
```

Example output:

ib_core               417792  8 rdma_cm,ib_ipoib,iw_cm,ib_umad,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm
mlx_compat             69632  12 rdma_cm,ib_ipoib,mlxdevm,mlxfw,iw_cm,ib_umad,ib_core,rdma_ucm,ib_uverbs,mlx5_ib,ib_cm,mlx5_core

Step-by-Step Instructions

1. Confirm the NVIDIADriver CR name

The driver DaemonSet is owned by a NVIDIADriver CR, not directly by the ClusterPolicy. First, identify the CR name:

$ kubectl get nvidiadriver -A

Note the name of the CR (e.g., b200). You can also trace ownership from the driver DaemonSet:

# Find the name of nvidia gpu driver daemonset
$ kubectl get ds -n nvidia-gpu-operator | grep nvidia-gpu-driver

# Replace <daemonset-name> with the name found in the previous command
$ kubectl get ds <daemonset-name> -n nvidia-gpu-operator -o yaml | grep -A5 ownerReferences

This will show the NVIDIADriver CR name under ownerReferences.

2. Patch the NVIDIADriver CR to enable RDMA

Patch the NVIDIADriver CR (Note: replace b200 with your CR name if different):

$ kubectl patch nvidiadriver b200 \
  --type merge \
  -p '{"spec":{"rdma":{"enabled":true}}}'

Expected output:

nvidiadriver.nvidia.com/b200 patched

Note: Do not patch ClusterPolicy alone - in clusters using the NVIDIADriver CR, this will not trigger DaemonSet reconciliation.

3. Verify the DaemonSet has been reconciled

After patching, confirm that the driver DaemonSet now includes the nvidia-peermem-ctr container:

$ kubectl get ds <daemonset-name> -n nvidia-gpu-operator \
  -o jsonpath='{.spec.template.spec.containers[*].name}'

Expected output:

nvidia-driver-ctr nvidia-peermem-ctr

If the DaemonSet update strategy is OnDelete, existing pods will not be restarted automatically — you may need to delete driver pods manually to trigger the update:

# Get pod name 
$ kubectl get pods -n nvidia-gpu-operator | grep nvidia-gpu-driver

# delete pod
$ kubectl delete pod <driver-pod-name> -n nvidia-gpu-operator

4. Confirm peermem is loaded successfully

Check the logs of the nvidia-peermem-ctr container in the driver pod:

$ kubectl logs -n nvidia-gpu-operator <driver-pod-name> -c nvidia-peermem-ctr

Expected output:

DRIVER_ARCH is x86_64
...
waiting for mellanox ofed and nvidia drivers to be installed
waiting for mellanox ofed and nvidia drivers to be installed
...
successfully loaded nvidia-peermem module, now waiting for signal

Note: The nvidia-peermem-ctr container may restart once on initial startup due to a module load race condition. This is expected - the container stabilizes after the main driver container finishes loading.

5. Verify on the node

SSH into the node and confirm the module is loaded:

$ lsmod | grep nvidia_peermem

Expected output:

nvidia_peermem         16384  0
nvidia              11526144  77 nvidia_uvm,nvidia_peermem,nvidia_modeset
ib_uverbs             163840  3 nvidia_peermem,rdma_ucm,mlx5_ib

Optionally, run the following to confirm RDMA devices are visible and properly configured:

# Install (if not already done) 
$ sudo apt install ibverbs-utils

# Then run the following command
$ ibv_devinfo -v

A healthy setup will show each HCA with valid gid_table entries and active ports.

Common Issues

Issue: Patching ClusterPolicy has no effect on the DaemonSet

The ClusterPolicy change does not propagate when a NVIDIADriver CR is present. Target the NVIDIADriver CR instead (see Step 2).

Issue: nvidia-peermem-ctr restarts once on startup

This is expected. The container runs reload_nvidia_peermem and may fail on the first attempt if the main driver container hasn't finished loading modules yet. It will stabilize automatically - confirm with kubectl logs.

Issue: DaemonSet not updating after patch

Check the DaemonSet's updateStrategy. If it is OnDelete, pods must be deleted manually to pick up the new spec. Delete the driver pod and allow it to be rescheduled.

Additional Resources

Related to

how-to rdma #cmk peermem legacy

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

Step-by-Step Instructions

Common Issues

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To Enable NVIDIA PeerMem (Legacy RDMA) on CMK Clusters

Introduction

Prerequisites

Step-by-Step Instructions

Common Issues

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments