Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Run NCCL Tests Using Crusoe Managed Slurm

Sagar Lulla
Sagar Lulla
Updated

Introduction 

Crusoe Managed Slurm provides managed HPC cluster orchestration on Crusoe Cloud, backed by GPU-optimized infrastructure with topology-aware scheduling, shared storage, and automatic hardware remediation via AutoClusters. Container workloads on Slurm workers are launched through Pyxis, which integrates srun --container-image=... directly with the scheduler.

Before running real distributed training on a new cluster, it's worth validating the InfiniBand fabric end-to-end. The standard test is an NCCL all-reduce across multiple GPU nodes, which exercises both intra-node NVLink and inter-node IB and reports a single aggregate bus bandwidth number.

This article walks through that test on Managed Slurm. The procedure works for any IB-enabled GPU instance type on Crusoe — H100, H200, or A100 80GB SXM IB — with a single environment variable change.

Prerequisites

  • Crusoe Managed Slurm Cluster With at Least 2 IB-Enabled GPU Nodes (H100, H200, or A100 80GB SXM IB)
  • Kubeconfig Access to the Underlying CMK Cluster
  • Access to the Login Pod (via SSH)
  • GPU Operator and Network Operator Enabled on CMK (Default for IB-Enabled Node Pools)
  • sinfo Confirms gpu:<model>:8 on Each Worker Node

Instructions

Step 1: Verify GPU Visibility in Slurm

Before running anything, confirm Slurm sees the expected number of GPUs on every worker. From the login pod:

sinfo -o "%N %c %m %G"

You should see one line per worker with the GRES column showing gpu:h100:8gpu:h200:8, or gpu:a100:8 depending on instance type. If GRES shows (null), the GPU Operator or Slurm-side GRES wiring has not converged yet — resolve that before proceeding.

Step 2: Smoke-Test Pyxis and Single-GPU Allocation

Verify Slurm can hand you a GPU through Pyxis with a one-shot srun from the login pod:

srun --nodes=1 --gres=gpu:1 \
     --container-image=public.ecr.aws/hpc-cloud/nccl-tests:latest \
     nvidia-smi -L

The first run pulls the nccl-tests container image onto the worker (2–5 minutes); subsequent runs start in seconds. You should see at least one GPU listed.

ℹ️ Note: nvidia-smi may list more GPUs than you requested because NVML reports all device files mounted into the container. Actual GPU isolation for CUDA workloads (including NCCL) is enforced through CUDA_VISIBLE_DEVICES, which Slurm sets correctly based on --gres.

Step 3: Create the NCCL Sbatch Script

Save the following as /home/nccl-test.sh on the login pod. /home is the shared PVC visible across login and worker pods, so the file is accessible from everywhere.

#!/bin/bash
#SBATCH --job-name=nccl-allreduce
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=22
#SBATCH --exclusive
#SBATCH --output=/home/nccl-runs/nccl-%j-%N.out
#SBATCH --error=/home/nccl-runs/nccl-%j-%N.err
#SBATCH --time=00:30:00

mkdir -p /home/nccl-runs

export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml
export NCCL_IB_HCA=^mlx5_0:1
export NCCL_IB_MERGE_VFS=0
export NCCL_ALGO=NVLSTree
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=8

srun --mpi=pmix \
     --container-image=public.ecr.aws/hpc-cloud/nccl-tests:latest \
     --container-mounts=/etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo \
     /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

Make it executable and confirm the contents:

chmod +x /home/nccl-test.sh
cat /home/nccl-test.sh

ℹ️ Note: For H200 and A100 SXM IB instance types, replace the NCCL_TOPO_FILE path with the matching topology XML in /etc/crusoe/nccl_topo/. The directory is mounted into worker pods automatically; running ls /etc/crusoe/nccl_topo/ from any slurmd pod will list the available files.

What the key flags do:

  • --nodes=2 --ntasks-per-node=8 --gres=gpu:8 — 16 total MPI ranks, one per GPU, spread across 2 worker nodes.
  • --cpus-per-task=22 — 176 host vCPUs divided across 8 tasks gives each rank a fair share of CPU resources for NCCL transport threads.
  • --exclusive — prevents another job from co-scheduling onto the same node and contending for the IB HCAs.
  • --mpi=pmix — Slurm's native MPI bootstrap (replaces mpirun). PMIx ships in the Crusoe slurmd container image.
  • --container-image=... — Pyxis pulls and launches this container on each worker. The nccl-tests image already has NCCL, CUDA, and OpenMPI built in.
  • --container-mounts=/etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo — exposes Crusoe's host-rendered NCCL topology XML into the container so NCCL reads the correct NVLink/IB layout.

Step 4: Submit and Monitor

sbatch /home/nccl-test.sh

You'll see Submitted batch job <id>. Track it:

squeue
tail -f /home/nccl-runs/nccl-<id>-*.out

The benchmark takes 1–3 minutes on a warm container (longer on first run while the image pulls).

Example

A real run on a 2-node H100 80GB SXM IB cluster (16 GPUs total) using the sbatch script above:

# Using devices
#  Rank  0 ... on slurm-cluster-1-nodeset-1-1 device 0  NVIDIA H100 80GB HBM3
#  Rank  1 ... on slurm-cluster-1-nodeset-1-1 device 1  NVIDIA H100 80GB HBM3
#  ...
#  Rank 15 ... on slurm-cluster-1-nodeset-1-0 device 7  NVIDIA H100 80GB HBM3
#
#       size      count  type  redop  root   time   algbw   busbw  #wrong   time   algbw   busbw  #wrong
     1048576     262144 float  sum    -1    121.1    8.66   16.24      0    82.18  12.76   23.92      0
     8388608    2097152 float  sum    -1    153.5   54.64  102.46      0    151.0  55.56  104.17      0
    67108864   16777216 float  sum    -1    474.2  141.51  265.34      0    473.5 141.72  265.72      0
   268435456   67108864 float  sum    -1   1292.4  207.71  389.46      0   1305.7 205.59  385.48      0
  1073741824  268435456 float  sum    -1   4513.5  237.89  446.05      0   4529.5 237.06  444.48      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 224.86

The 224.86 GB/s avg bus bandwidth confirms the IB fabric is delivering expected H100 SXM performance for collective operations. This cluster is ready for distributed training workloads such as multi-node FSDP, DeepSpeed, or NeMo runs.

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.