How-To Run NCCL Tests Using Crusoe Managed Slurm

Introduction

Crusoe Managed Slurm provides managed HPC cluster orchestration on Crusoe Cloud, backed by GPU-optimized infrastructure with topology-aware scheduling, shared storage, and automatic hardware remediation via AutoClusters. Container workloads on Slurm workers are launched through Pyxis, which integrates srun --container-image=... directly with the scheduler.

Before running real distributed training on a new cluster, it's worth validating the InfiniBand fabric end-to-end. The standard test is an NCCL all-reduce across multiple GPU nodes, which exercises both intra-node NVLink and inter-node IB and reports a single aggregate bus bandwidth number.

This article walks through that test on Managed Slurm. The procedure works for any IB-enabled GPU instance type on Crusoe — H100, H200, or A100 80GB SXM IB — with a single environment variable change.

Prerequisites

Crusoe Managed Slurm Cluster With at Least 2 IB-Enabled GPU Nodes (H100, H200, or A100 80GB SXM IB)
Kubeconfig Access to the Underlying CMK Cluster
Access to the Login Pod (via SSH)
GPU Operator and Network Operator Enabled on CMK (Default for IB-Enabled Node Pools)
sinfo Confirms gpu:<model>:8 on Each Worker Node

Instructions

Step 1: Verify GPU Visibility in Slurm

Before running anything, confirm Slurm sees the expected number of GPUs on every worker. From the login pod:

sinfo -o "%N %c %m %G"

You should see one line per worker with the GRES column showing gpu:h100:8, gpu:h200:8, or gpu:a100:8 depending on instance type. If GRES shows (null), the GPU Operator or Slurm-side GRES wiring has not converged yet — resolve that before proceeding.

Step 2: Smoke-Test Pyxis and Single-GPU Allocation

Verify Slurm can hand you a GPU through Pyxis with a one-shot srun from the login pod:

srun --nodes=1 --gres=gpu:1 \
     --container-image=public.ecr.aws/hpc-cloud/nccl-tests:latest \
     nvidia-smi -L

The first run pulls the nccl-tests container image onto the worker (2–5 minutes); subsequent runs start in seconds. You should see at least one GPU listed.

ℹ️ Note: nvidia-smi may list more GPUs than you requested because NVML reports all device files mounted into the container. Actual GPU isolation for CUDA workloads (including NCCL) is enforced through CUDA_VISIBLE_DEVICES, which Slurm sets correctly based on --gres.

Step 3: Create the NCCL Sbatch Script

Save the following as /home/nccl-test.sh on the login pod. /home is the shared PVC visible across login and worker pods, so the file is accessible from everywhere.

#!/bin/bash
#SBATCH --job-name=nccl-allreduce
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=22
#SBATCH --exclusive
#SBATCH --output=/home/nccl-runs/nccl-%j-%N.out
#SBATCH --error=/home/nccl-runs/nccl-%j-%N.err
#SBATCH --time=00:30:00

mkdir -p /home/nccl-runs

export NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml
export NCCL_IB_HCA=^mlx5_0:1
export NCCL_IB_MERGE_VFS=0
export NCCL_ALGO=NVLSTree
export NCCL_SOCKET_NTHREADS=4
export NCCL_NSOCKS_PERTHREAD=8

srun --mpi=pmix \
     --container-image=public.ecr.aws/hpc-cloud/nccl-tests:latest \
     --container-mounts=/etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo \
     /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

Make it executable and confirm the contents:

chmod +x /home/nccl-test.sh
cat /home/nccl-test.sh

ℹ️ Note: For H200 and A100 SXM IB instance types, replace the NCCL_TOPO_FILE path with the matching topology XML in /etc/crusoe/nccl_topo/. The directory is mounted into worker pods automatically; running ls /etc/crusoe/nccl_topo/ from any slurmd pod will list the available files.

What the key flags do:

--nodes=2 --ntasks-per-node=8 --gres=gpu:8 — 16 total MPI ranks, one per GPU, spread across 2 worker nodes.
--cpus-per-task=22 — 176 host vCPUs divided across 8 tasks gives each rank a fair share of CPU resources for NCCL transport threads.
--exclusive — prevents another job from co-scheduling onto the same node and contending for the IB HCAs.
--mpi=pmix — Slurm's native MPI bootstrap (replaces mpirun). PMIx ships in the Crusoe slurmd container image.
--container-image=... — Pyxis pulls and launches this container on each worker. The nccl-tests image already has NCCL, CUDA, and OpenMPI built in.
--container-mounts=/etc/crusoe/nccl_topo:/etc/crusoe/nccl_topo — exposes Crusoe's host-rendered NCCL topology XML into the container so NCCL reads the correct NVLink/IB layout.

Step 4: Submit and Monitor

sbatch /home/nccl-test.sh

You'll see Submitted batch job <id>. Track it:

squeue
tail -f /home/nccl-runs/nccl-<id>-*.out

The benchmark takes 1–3 minutes on a warm container (longer on first run while the image pulls).

Example

A real run on a 2-node H100 80GB SXM IB cluster (16 GPUs total) using the sbatch script above:

# Using devices
#  Rank  0 ... on slurm-cluster-1-nodeset-1-1 device 0  NVIDIA H100 80GB HBM3
#  Rank  1 ... on slurm-cluster-1-nodeset-1-1 device 1  NVIDIA H100 80GB HBM3
#  ...
#  Rank 15 ... on slurm-cluster-1-nodeset-1-0 device 7  NVIDIA H100 80GB HBM3
#
#       size      count  type  redop  root   time   algbw   busbw  #wrong   time   algbw   busbw  #wrong
     1048576     262144 float  sum    -1    121.1    8.66   16.24      0    82.18  12.76   23.92      0
     8388608    2097152 float  sum    -1    153.5   54.64  102.46      0    151.0  55.56  104.17      0
    67108864   16777216 float  sum    -1    474.2  141.51  265.34      0    473.5 141.72  265.72      0
   268435456   67108864 float  sum    -1   1292.4  207.71  389.46      0   1305.7 205.59  385.48      0
  1073741824  268435456 float  sum    -1   4513.5  237.89  446.05      0   4529.5 237.06  444.48      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 224.86

The 224.86 GB/s avg bus bandwidth confirms the IB fabric is delivering expected H100 SXM performance for collective operations. This cluster is ready for distributed training workloads such as multi-node FSDP, DeepSpeed, or NeMo runs.

Related to

slurm nccl how-to slurm nodes managed slurm nccl test slinky

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

Instructions

Step 1: Verify GPU Visibility in Slurm

Step 2: Smoke-Test Pyxis and Single-GPU Allocation

Step 3: Create the NCCL Sbatch Script

Step 4: Submit and Monitor

Example

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To Run NCCL Tests Using Crusoe Managed Slurm

Introduction

Prerequisites

Instructions

Step 1: Verify GPU Visibility in Slurm

Step 2: Smoke-Test Pyxis and Single-GPU Allocation

Step 3: Create the NCCL Sbatch Script

Step 4: Submit and Monitor

Example

Related Articles

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments