Last Updated: March 31, 2026
Introduction
Prerequisites
- Two Crusoe Cloud SXM IB instances with 8 GPUs per node (e.g.,
h200-80gb-sxm-ib.8x) - Image:
ubuntu22.04-nvidia-sxm-docker, latest - SSH access to both nodes
- Firewall rule allowing TCP port 8361 between the two nodes (used by PyTorch distributed initialisation)
- Passwordless SSH between both nodes (for inter-node test only)
| Component | Version |
NVIDIA Driver |
570.x (R570) |
CUDA |
12.8 |
GPUs per node |
8× NVIDIA H200 SXM (141 GB HBM3e) |
InfiniBand NICs |
8× ConnectX-7 per node (mlx5_1 through mlx5_8) |
GDRCopy |
Pre-installed (gdrdrv module loaded) |
Step-by-Step Instructions
Step 1: Prepare the System (Run the following on both nodes.)
- Update package lists and install build dependencies:
$ sudo apt update $ sudo apt install -y cmake ninja-build python3-pip
- Verify the pre-installed components:
$ nvidia-smi $ ls /usr/local/cuda/bin/nvcc $ lsmod | grep gdrdrv
You should see8 GPUs, theCUDA 12.8toolkit, and thegdrdrvkernel module loaded. - Add CUDA to your shell environment:
$ echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc $ echo 'export PATH="${CUDA_HOME}/bin:${PATH}"' >> ~/.bashrc $ echo 'export LD_LIBRARY_PATH="${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}"' >> ~/.bashrc $ source ~/.bashrc $ nvcc --version
Step 2: Configure NVIDIA Driver for IBGDA
NVreg_EnableStreamMemOPs=1 (enables CUDA stream-ordered memory operations for GPU-initiated RDMA.), PeerMappingOverride=1 (allows peer memory mappings across GPUs for GPUDirect RDMA.)-
Create the modprobe config. (Run the following on both nodes.)
$ echo 'options nvidia NVreg_EnableStreamMemOPs=1 NVreg_RegistryDwords="PeerMappingOverride=1;"' | sudo tee /etc/modprobe.d/nvidia.conf
- Update initramfs and reboot:
$ sudo update-initramfs -u $ sync $ sudo reboot
Warning: If you are automating this setup with a script, the
synccommand is critical. There is a known race condition where the system may reboot beforeupdate-initramfsfinishes writing to disk, resulting in a corrupted initramfs that preventssshdfrom starting. When running commands manually (step by step), this is not an issue.After reboot, verify the parameters took effect:$ cat /proc/driver/nvidia/params | grep -E "EnableStreamMemOPs|PeerMappingOverride"
Step 3: Install PyTorch and NVSHMEM
-
First, check your system's CUDA version:
$ nvidia-smi | grep "CUDA Version"
- Install PyTorch with the matching CUDA variant. For example, if
nvidia-smishowsCUDA 12.8:pip3 install torch numpy --index-url https://download.pytorch.org/whl/cu128
Note: The default
pip3 install torchmay pull a newer CUDA version (e.g.,13.0) that does not match your system's CUDA toolkit. This will cause DeepEP's build to fail with a CUDA version mismatch error. Always use--index-urlto match the version shown bynvidia-smi.PyTorch automatically pullsnvidia-nvshmemas a dependency, so no manual NVSHMEM source build is required. Verify it was installed and note the path:$ pip3 list | grep nvshmem $ python3 -c "import nvidia.nvshmem; print(nvidia.nvshmem.__path__)"
Set NVSHMEM paths persistently on both nodes:$ cat >> ~/.bashrc << 'EOF' export NVSHMEM_DIR=/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem export LD_LIBRARY_PATH="${NVSHMEM_DIR}/lib:${LD_LIBRARY_PATH}" EOF $ source ~/.bashrc
Step 4: Install DeepEP
- Clone the repository:
$ cd ~ $ git clone https://github.com/deepseek-ai/DeepEP.git
The pip-installed NVSHMEM package ships a versionedlibnvshmem_host.so.<version>, but DeepEP's build expects the unversionedlibnvshmem_host.so. Check which version is present and create a symlink:$ ls ${NVSHMEM_DIR}/lib/libnvshmem_host.so* $ ln -s ${NVSHMEM_DIR}/lib/libnvshmem_host.so.<version> ${NVSHMEM_DIR}/lib/libnvshmem_host.so
-
Install DeepEP:
$ cd ~/DeepEP $ NVSHMEM_DIR=${NVSHMEM_DIR} pip3 install --no-build-isolation -e . Note: The
--no-build-isolationflag is required. Without it, pip creates an isolated build environment that cannot find your installed PyTorch, and the build fails with'No module named torch'.
Step 5: Configure GPU-to-HCA Mapping (Critical for Crusoe)
Why per-process mapping is required: DeepEP's inter-node test spawns 8 GPU subprocesses per node via
torch.multiprocessing.spawn. Each subprocess initialises its own NVSHMEM context with only 2 PEs (itself + its counterpart on the other node). Global environment variables like NVSHMEM_HCA_PE_MAPPING or NVSHMEM_HCA_LIST cannot distinguish between these subprocesses, all local PEs are PE 0 in their respective contexts, so they all map to the first NIC. The fix is to set NV
SHMEM_HCA_LIST per-subprocess to a single NIC before NVSHMEM initialises.- Apply the following patch on both nodes. This inserts one line before the
Buffer()constructor intest_internode.py:$ cd ~/DeepEP $ sed -i 's/ buffer = deep_ep.Buffer(group,/ os.environ["NVSHMEM_HCA_LIST"] = f"mlx5_{local_rank + 1}:1"\n buffer = deep_ep.Buffer(group,/' tests/test_internode.pyThe effect is thatGPU 0usesmlx5_1,GPU 1usesmlx5_2, ...,GPU 7usesmlx5_8. EachNVSHMEMcontext sees only its assigned NIC, eliminating the single-NIC bottleneck. The same pattern should be applied in any application code that creates adeep_ep.Bufferon Crusoe, setNVSHMEM_HCA_LISTto the correct NIC for the local rank before the Buffer is constructed.
Step 6: Run the Intra-Node Test (NVLink Only)
- This test validates NVLink communication between 8 GPUs on a single node. It runs correctness checks for BF16 and FP8 dispatch/combine kernels, then measures bandwidth.
$ python3 ~/DeepEP/tests/test_intranode.py
All correctness tests should showpassed. The tuning phase will sweep NVLink chunk sizes and report the best bandwidth for FP8 dispatch, BF16 dispatch, and combine.
Step 7: Run the Low-Latency Test (IBGDA)
- This test exercises the low-latency IBGDA kernels on a single node. These are the kernels designed for inference where per-token latency matters, the GPU posts RDMA work requests directly to the InfiniBand HCA, bypassing the CPU entirely.
$ python3 ~/DeepEP/tests/test_low_latency.py
All correctness tests should showpassed. The tuning phase will report bandwidth for dispatch and combine operations. This test runs on a single node and does not require the per-subprocess HCA mapping from Step 5.
Step 8: Run the Inter-Node Test (NVLink + InfiniBand RDMA)
- This test exercises the full two-node communication stack, NVLink for intra-node traffic and InfiniBand RDMA for inter-node traffic, running simultaneously.Prepare SSH access between nodes:
# On each node: $ ssh-keygen -t rsa -N "" -f ~/.ssh/id_rsa # Exchange public keys manually: # On node 1, copy the output of: $ cat ~/.ssh/id_rsa.pub # Then on node 2, append it: $ echo "<node1-pubkey>" >> ~/.ssh/authorized_keys # Repeat in the other direction.
Verify passwordless SSH works in both directions:# From node 1: $ ssh <NODE2_IP> hostname # From node 2: $ ssh <NODE1_IP> hostname
Load HPC-X (provides mpirun):$ source /opt/hpcx/hpcx-init.sh $ hpcx_load
Run the test (replace<NODE1_IP>and<NODE2_IP>with your node IPs):$ mpirun --allow-run-as-root -np 2 \ --host <NODE1_IP>:1,<NODE2_IP>:1 \ -x MASTER_ADDR=<NODE1_IP> \ -x MASTER_PORT=8361 \ -x WORLD_SIZE=2 \ -x CUDA_HOME=/usr/local/cuda \ -x PATH \ -x LD_LIBRARY_PATH=/usr/local/cuda/lib64:/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem/lib:/home/ubuntu/.local/lib/python3.10/site-packages/torch/lib \ -x NVSHMEM_DIR=/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem \ --mca btl_tcp_if_include eth0 \ --mca btl self,tcp \ --mca pml ob1 \ bash -c 'RANK=$OMPI_COMM_WORLD_RANK python3 /home/ubuntu/DeepEP/tests/test_internode.py'Note: The GPU-to-HCA mapping is handled insidetest_internode.py(patched in Step 4), not via mpirun-xflags. Global NVSHMEM mapping variables likeNVSHMEM_HCA_PE_MAPPINGorNVSHMEM_HCA_LISTdo not work here because each GPU subprocess creates its own 2-PE NVSHMEM context, and all local PEs would map to the same NIC. See Step 5 for details.
Troubleshooting
-
pip3: command not foundFresh Crusoe images may not include pip. Install withsudo apt install -y python3-pip. -
apt installreturns 404 errorsRunsudo apt updatefirst to refresh package lists.NVSHMEM_DIR detection fails- Ifnvidia.nvshmem.__file__returnsNone, manually setexport NVSHMEM_DIR=/home/ubuntu/.local/lib/python3.10/site-packages/nvidia/nvshmem. -
'cannot find -l:libnvshmem_host.so'during DeepEP buildThepippackage only ships the versioned library. Create the symlink as shown in Step 4. -
'No module named torch'duringpip install -e .Use--no-build-isolationto let the build see your existing PyTorch installation. -
Inter-node RDMA bandwidth stuck at ~7 GB/s instead of ~65 GB/sAll GPU subprocesses are likely routing through a single NIC. Verify by checking per-NIC transmit counters during a test run:
for i in 1 2 3 4 5 6 7 8; do echo -n "mlx5_$i: "; cat /sys/class/infiniband/mlx5_$i/ports/1/counters/port_xmit_data; done
If only one NIC shows significantly higher counters, the per-subprocessNVSHMEM_HCA_LISTpatch from Step 5 is not applied. Global env vars (NVSHMEM_HCA_PE_MAPPING,NVSHMEM_HCA_LIST) do not fix this, the mapping must be set per-GPU-subprocess beforedeep_ep.Buffer()is constructed. -
'cudaErrorNotPermitted'(error 800) oncudaHostRegisterThe NVIDIA driver does not support thecudaHostRegisterIoMemoryflag needed for IBGDA. Verify that_/etc/modprobe.d/nvidia.conf_is correctly configured and the node has been rebooted. If using a container with a CUDA version newer than the host driver (e.g., CUDA 13 container on R570 host), this mismatch can also cause the error. -
SSH breaks after scripted setup /
sshdfails to start after rebootTheinitramfswas not written to disk before the reboot. Always runsudo update-initramfs -u && syncbeforesudo rebootin scripts. -
Inter-node test hangsCheck that port 8361 is open in your Crusoe firewall rules between both nodes. Verify passwordless SSH works in both directions.