How-To Validate DeepEP GPU Communication Performance on Crusoe Cloud

Last Updated: March 31, 2026

Introduction

DeepEP is an open-source communication library from DeepSeek that accelerates Expert Parallelism (EP) in Mixture-of-Experts (MoE) models. It provides optimised all-to-all communication kernels for dispatching tokens to expert GPUs and combining results, using NVLink for intra-node traffic and InfiniBand RDMA for inter-node traffic.

This guide walks through installing DeepEP on Crusoe Cloud GPU IB instances, configuring GPU-to-HCA mapping (critical on Crusoe), and running all three benchmark suites: intra-node (NVLink), low-latency (IBGDA), and inter-node (NVLink + RDMA). By the end, you will have validated that your GPU communication fabric is performing at expected levels.

Prerequisites

Two Crusoe Cloud SXM IB instances with 8 GPUs per node (e.g., h200-80gb-sxm-ib.8x)
Image: ubuntu22.04-nvidia-sxm-docker, latest
SSH access to both nodes
Firewall rule allowing TCP port 8361 between the two nodes (used by PyTorch distributed initialisation)
Passwordless SSH between both nodes (for inter-node test only)

Note: This guide was tested on H200 SXM IB nodes. The steps should apply to other Crusoe SXM IB instance types (e.g., H100) with the same image, though expected benchmark numbers will vary by GPU generation.

Expected environment after provisioning:

Component	Version
NVIDIA Driver	570.x (R570)
CUDA	12.8
GPUs per node	8× NVIDIA H200 SXM (141 GB HBM3e)
InfiniBand NICs	8× ConnectX-7 per node (mlx5_1 through mlx5_8)
GDRCopy	Pre-installed (gdrdrv module loaded)

Step-by-Step Instructions

Step 1: Prepare the System (Run the following on both nodes.)

Update package lists and install build dependencies:

$ sudo apt update
$ sudo apt install -y cmake ninja-build python3-pip

Verify the pre-installed components:
```
$ nvidia-smi
$ ls /usr/local/cuda/bin/nvcc
$ lsmod | grep gdrdrv
```
You should see 8 GPUs, the CUDA 12.8 toolkit, and the gdrdrv kernel module loaded.

Add CUDA to your shell environment:

$ echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
$ echo 'export PATH="${CUDA_HOME}/bin:${PATH}"' >> ~/.bashrc
$ echo 'export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"' >> ~/.bashrc
$ source ~/.bashrc
$ nvcc --version

Step 2: Configure NVIDIA Driver for IBGDA

DeepEP's low-latency mode uses IBGDA (InfiniBand GPUDirect Async), which requires the GPU to memory-map the InfiniBand NIC's UAR (User Access Region). This needs two driver parameters: NVreg_EnableStreamMemOPs=1 (enables CUDA stream-ordered memory operations for GPU-initiated RDMA.), PeerMappingOverride=1 (allows peer memory mappings across GPUs for GPUDirect RDMA.)

Create the modprobe config. (Run the following on both nodes.)

$ echo 'options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"' | sudo tee /etc/modprobe.d/nvidia.conf

Update initramfs and reboot:
```
$ sudo update-initramfs -u
$ sync
$ sudo reboot
```
Warning: If you are automating this setup with a script, the sync command is critical. There is a known race condition where the system may reboot before update-initramfs finishes writing to disk, resulting in a corrupted initramfs that prevents sshd from starting. When running commands manually (step by step), this is not an issue.
After reboot, verify the parameters took effect:
```
$ cat /proc/driver/nvidia/params | grep -E "EnableStreamMemOPs|PeerMappingOverride"
```

Step 3: Install PyTorch and NVSHMEM

First, check your system's CUDA version:
```
$ nvidia-smi | grep "CUDA Version"
```
Install PyTorch with the matching CUDA variant. For example, if nvidia-smi shows CUDA 12.8:
```
pip3 install torch numpy --index-url https://download.pytorch.org/whl/cu128
```
Note: The default pip3 install torch may pull a newer CUDA version (e.g., 13.0) that does not match your system's CUDA toolkit. This will cause DeepEP's build to fail with a CUDA version mismatch error. Always use --index-url to match the version shown by nvidia-smi.
PyTorch automatically pulls nvidia-nvshmem as a dependency, so no manual NVSHMEM source build is required. Verify it was installed and note the path:
```
$ pip3 list | grep nvshmem
$ python3 -c "import nvidia.nvshmem; print(nvidia.nvshmem.__path__)"
```
Set NVSHMEM paths persistently on both nodes:
```
$ cat >> ~/.bashrc << 'EOF'
export NVSHMEM_DIR=/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem
export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:${LD_LIBRARY_PATH}"
EOF
$ source ~/.bashrc
```

Step 4: Install DeepEP

Clone the repository:
```
$ cd ~
$ git clone https://github.com/deepseek-ai/DeepEP.git
```
The pip-installed NVSHMEM package ships a versioned libnvshmem_host.so.<version>, but DeepEP's build expects the unversioned libnvshmem_host.so. Check which version is present and create a symlink:
```
$ ls ${NVSHMEM_DIR}/lib/libnvshmem_host.so*
$ ln -s ${NVSHMEM_DIR}/lib/libnvshmem_host.so.<version> ${NVSHMEM_DIR}/lib/libnvshmem_host.so
```

Install DeepEP:

$ cd ~/DeepEP
$ NVSHMEM_DIR=${NVSHMEM_DIR} pip3 install --no-build-isolation -e .

Note: The --no-build-isolation flag is required. Without it, pip creates an isolated build environment that cannot find your installed PyTorch, and the build fails with 'No module named torch'.

Step 5: Configure GPU-to-HCA Mapping (Critical for Crusoe)

Crusoe's cloud-hypervisor does not fully virtualise the PCI tree, so NVSHMEM cannot auto-detect which GPU is closest to which InfiniBand NIC. Without explicit mapping, all GPU subprocesses default to the same NIC and inter-node RDMA bandwidth drops to ~7 GB/s instead of the expected ~65 GB/s.

Why per-process mapping is required: DeepEP's inter-node test spawns 8 GPU subprocesses per node via torch.multiprocessing.spawn. Each subprocess initialises its own NVSHMEM context with only 2 PEs (itself + its counterpart on the other node). Global environment variables like NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST cannot distinguish between these subprocesses, all local PEs are PE 0 in their respective contexts, so they all map to the first NIC. The fix is to set NV

SHMEM_HCA_LIST per-subprocess to a single NIC before NVSHMEM initialises.

Apply the following patch on both nodes. This inserts one line before the Buffer() constructor in test_internode.py:
```
$ cd ~/DeepEP
$ sed -i 's/    buffer = deep_ep.Buffer(group,/    os.environ["NVSHMEM_HCA_LIST"] = f"mlx5_{local_rank + 1}:1"\n    buffer = deep_ep.Buffer(group,/' tests/test_internode.py
```
The effect is that GPU 0 uses mlx5_1, GPU 1 uses mlx5_2, ..., GPU 7 uses mlx5_8. Each NVSHMEM context sees only its assigned NIC, eliminating the single-NIC bottleneck. The same pattern should be applied in any application code that creates a deep_ep.Buffer on Crusoe, set NVSHMEM_HCA_LIST to the correct NIC for the local rank before the Buffer is constructed.

Step 6: Run the Intra-Node Test (NVLink Only)

This test validates NVLink communication between 8 GPUs on a single node. It runs correctness checks for BF16 and FP8 dispatch/combine kernels, then measures bandwidth.
```
$ python3 ~/DeepEP/tests/test_intranode.py
```
All correctness tests should show passed. The tuning phase will sweep NVLink chunk sizes and report the best bandwidth for FP8 dispatch, BF16 dispatch, and combine.

Step 7: Run the Low-Latency Test (IBGDA)

This test exercises the low-latency IBGDA kernels on a single node. These are the kernels designed for inference where per-token latency matters, the GPU posts RDMA work requests directly to the InfiniBand HCA, bypassing the CPU entirely.
```
$ python3 ~/DeepEP/tests/test_low_latency.py
```
All correctness tests should show passed. The tuning phase will report bandwidth for dispatch and combine operations. This test runs on a single node and does not require the per-subprocess HCA mapping from Step 5.

Step 8: Run the Inter-Node Test (NVLink + InfiniBand RDMA)

This test exercises the full two-node communication stack, NVLink for intra-node traffic and InfiniBand RDMA for inter-node traffic, running simultaneously.

Prepare SSH access between nodes:

# On each node:
$ ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa

# Exchange public keys manually:
# On node 1, copy the output of:
$ cat ~/.ssh/id_rsa.pub

# Then on node 2, append it:
$ echo "<node1-pubkey>" >> ~/.ssh/authorized_keys

# Repeat in the other direction.

Verify passwordless SSH works in both directions:

# From node 1:
$ ssh <NODE2_IP> hostname

# From node 2:
$ ssh <NODE1_IP> hostname

Load HPC-X (provides mpirun):

$ source /opt/hpcx/hpcx-init.sh
$ hpcx_load

Run the test (replace <NODE1_IP> and <NODE2_IP> with your node IPs):

$ mpirun --allow-run-as-root -np 2 \
    --host <NODE1_IP>:1,<NODE2_IP>:1 \
    -x MASTER_ADDR=<NODE1_IP> \
    -x MASTER_PORT=8361 \
    -x WORLD_SIZE=2 \
    -x CUDA_HOME=/usr/local/cuda \
    -x PATH \
    -x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem/lib:/home/ubuntu/.local/lib/python3.10/site-packages/torch/lib \
    -x NVSHMEM_DIR=/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem \
    --mca btl_tcp_if_include eth0 \
    --mca btl self,tcp \
    --mca pml ob1 \
    bash -c 'RANK=$OMPI_COMM_WORLD_RANK python3 /home/ubuntu/DeepEP/tests/test_internode.py'

Note: The GPU-to-HCA mapping is handled inside test_internode.py (patched in Step 4), not via mpirun -x flags. Global NVSHMEM mapping variables like NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST do not work here because each GPU subprocess creates its own 2-PE NVSHMEM context, and all local PEs would map to the same NIC. See Step 5 for details.

Troubleshooting

pip3: command not found

Fresh Crusoe images may not include pip. Install with sudo apt install -y python3-pip.
apt install returns 404 errors

Run sudo apt update first to refresh package lists. NVSHMEM_DIR detection fails - If nvidia.nvshmem.__file__ returns None, manually set export NVSHMEM_DIR=/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem.
'cannot find -l:libnvshmem_host.so' during DeepEP build

The pip package only ships the versioned library. Create the symlink as shown in Step 4.
'No module named torch' during pip install -e .

Use --no-build-isolation to let the build see your existing PyTorch installation.
Inter-node RDMA bandwidth stuck at ~7 GB/s instead of ~65 GB/s

All GPU subprocesses are likely routing through a single NIC. Verify by checking per-NIC transmit counters during a test run:
```
for i in 1 2 3 4 5 6 7 8; do echo -n "mlx5_$i: "; cat /sys/class/infiniband/mlx5_$i/ports/1/counters/port_xmit_data; done
```
If only one NIC shows significantly higher counters, the per-subprocess NVSHMEM_HCA_LIST patch from Step 5 is not applied. Global env vars (NVSHMEM_HCA_PE_MAPPING, NVSHMEM_HCA_LIST) do not fix this, the mapping must be set per-GPU-subprocess before deep_ep.Buffer() is constructed.
'cudaErrorNotPermitted' (error 800) on cudaHostRegister

The NVIDIA driver does not support the cudaHostRegisterIoMemory flag needed for IBGDA. Verify that _/etc/modprobe.d/nvidia.conf_ is correctly configured and the node has been rebooted. If using a container with a CUDA version newer than the host driver (e.g., CUDA 13 container on R570 host), this mismatch can also cause the error.
SSH breaks after scripted setup / sshd fails to start after reboot

The initramfs was not written to disk before the reboot. Always run sudo update-initramfs -u && sync before sudo reboot in scripts.
Inter-node test hangs

Check that port 8361 is open in your Crusoe firewall rules between both nodes. Verify passwordless SSH works in both directions.

Additional Resources

Related to

how-to performance deepseek DeepEP

Introduction

Prerequisites

Step-by-Step Instructions

Troubleshooting

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To Validate DeepEP GPU Communication Performance on Crusoe Cloud

Introduction

Prerequisites

Step-by-Step Instructions

Troubleshooting

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments