Introduction
Crusoe Managed Slurm provides managed HPC cluster orchestration on Crusoe Cloud, backed by GPU-optimized infrastructure with topology-aware scheduling, shared storage, and automatic hardware remediation via AutoClusters. Container workloads on Slurm workers are launched through Pyxis, which integrates srun --container-image=... directly with the scheduler.
Before running real distributed training on a new cluster, it's worth validating the InfiniBand fabric end-to-end. The standard test is an NCCL all-reduce across multiple GPU nodes, which exercises both intra-node NVLink and inter-node IB and reports a single aggregate bus bandwidth number.
This article walks through that test on Managed Slurm. The procedure works for any IB-enabled GPU instance type on Crusoe — H100, H200, or A100 80GB SXM IB — with a single environment variable change.
Prerequisites
- Crusoe Managed Slurm Cluster With at Least 2 IB-Enabled GPU Nodes (H100, H200, or A100 80GB SXM IB)
- Kubeconfig Access to the Underlying CMK Cluster
- Access to the Login Pod (via SSH)
- GPU Operator and Network Operator Enabled on CMK (Default for IB-Enabled Node Pools)
-
sinfoConfirmsgpu:<model>:8on Each Worker Node
Instructions
Step 1: Verify GPU Visibility in Slurm
Before running anything, confirm Slurm sees the expected number of GPUs on every worker. From the login pod:
sinfo -o "%N %c %m %G"
You should see one line per worker with the GRES column showing gpu:h100:8, gpu:h200:8, or gpu:a100:8 depending on instance type. If GRES shows (null), the GPU Operator or Slurm-side GRES wiring has not converged yet — resolve that before proceeding.
Step 2: Smoke-Test Pyxis and Single-GPU Allocation
Verify Slurm can hand you a GPU through Pyxis with a one-shot srun from the login pod:
srun --nodes=1 --gres=gpu:1 \
--container-image=public.ecr.aws/hpc-cloud/nccl-tests:latest \
nvidia-smi -LThe first run pulls the nccl-tests container image onto the worker (2–5 minutes); subsequent runs start in seconds. You should see at least one GPU listed.
ℹ️ Note:
nvidia-smimay list more GPUs than you requested because NVML reports all device files mounted into the container. Actual GPU isolation for CUDA workloads (including NCCL) is enforced throughCUDA_VISIBLE_DEVICES, which Slurm sets correctly based on--gres.
Step 3: Create the NCCL Sbatch Script
Save the following as /home/nccl-test.sh on the login pod. /home is the shared PVC visible across login and worker pods, so the file is accessible from everywhere.
#!/bin/bash
#SBATCH --job-name=nccl-allreduce
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=22
#SBATCH --exclusive
#SBATCH --output=/home/nccl-runs/nccl-%j-%N.out
#SBATCH --error=/home/nccl-runs/nccl-%j-%N.err
#SBATCH --time=00:30:00
mkdir -p /home/nccl-runs
export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml
export NCCL_IB_HCA=^mlx5_0:1
export NCCL_IB_MERGE_VFS=0
export NCCL_ALGO=NVLSTree
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=8
srun --mpi=pmix \
--container-image=public.ecr.aws/hpc-cloud/nccl-tests:latest \
--container-mounts=/etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo \
/opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1Make it executable and confirm the contents:
chmod +x /home/nccl-test.sh cat /home/nccl-test.sh
ℹ️ Note: For H200 and A100 SXM IB instance types, replace the
NCCL_TOPO_FILEpath with the matching topology XML in/etc/crusoe/nccl_topo/. The directory is mounted into worker pods automatically; runningls /etc/crusoe/nccl_topo/from any slurmd pod will list the available files.
What the key flags do:
-
--nodes=2 --ntasks-per-node=8 --gres=gpu:8— 16 total MPI ranks, one per GPU, spread across 2 worker nodes. -
--cpus-per-task=22— 176 host vCPUs divided across 8 tasks gives each rank a fair share of CPU resources for NCCL transport threads. -
--exclusive— prevents another job from co-scheduling onto the same node and contending for the IB HCAs. -
--mpi=pmix— Slurm's native MPI bootstrap (replacesmpirun). PMIx ships in the Crusoe slurmd container image. -
--container-image=...— Pyxis pulls and launches this container on each worker. Thenccl-testsimage already has NCCL, CUDA, and OpenMPI built in. -
--container-mounts=/etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo— exposes Crusoe's host-rendered NCCL topology XML into the container so NCCL reads the correct NVLink/IB layout.
Step 4: Submit and Monitor
sbatch /home/nccl-test.sh
You'll see Submitted batch job <id>. Track it:
squeue tail -f /home/nccl-runs/nccl-<id>-*.out
The benchmark takes 1–3 minutes on a warm container (longer on first run while the image pulls).
Example
A real run on a 2-node H100 80GB SXM IB cluster (16 GPUs total) using the sbatch script above:
# Using devices
# Rank 0 ... on slurm-cluster-1-nodeset-1-1 device 0 NVIDIA H100 80GB HBM3
# Rank 1 ... on slurm-cluster-1-nodeset-1-1 device 1 NVIDIA H100 80GB HBM3
# ...
# Rank 15 ... on slurm-cluster-1-nodeset-1-0 device 7 NVIDIA H100 80GB HBM3
#
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
1048576 262144 float sum -1 121.1 8.66 16.24 0 82.18 12.76 23.92 0
8388608 2097152 float sum -1 153.5 54.64 102.46 0 151.0 55.56 104.17 0
67108864 16777216 float sum -1 474.2 141.51 265.34 0 473.5 141.72 265.72 0
268435456 67108864 float sum -1 1292.4 207.71 389.46 0 1305.7 205.59 385.48 0
1073741824 268435456 float sum -1 4513.5 237.89 446.05 0 4529.5 237.06 444.48 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 224.86The 224.86 GB/s avg bus bandwidth confirms the IB fabric is delivering expected H100 SXM performance for collective operations. This cluster is ready for distributed training workloads such as multi-node FSDP, DeepSpeed, or NeMo runs.