Introduction
Crusoe Cloud MI300X VMs are equipped with high-performance NVIDIA Mellanox InfiniBand networking. RCCL (ROCm Collective Communications Library) is AMD's equivalent of NVIDIA's NCCL — it handles collective communications (all-reduce, broadcast, etc.) across GPUs and nodes, and is the standard tool for validating IB fabric health before scaling distributed training workloads.
Running an RCCL all-reduce test confirms that your InfiniBand interconnect is functioning correctly and performing at expected bandwidth. This is particularly useful before launching large multi-node training jobs where a degraded fabric will silently bottleneck throughput.
Crusoe strongly recommends using the ubuntu-rocm:latest image (currently 22.04-6.2), which ships with ROCm pre-installed and includes Crusoe-specific RCCL topology files at /etc/crusoe/rccl_topo/ and /etc/rccl.conf. These topology files are required for RCCL to correctly map GPU and IB topology on a VM — without them, auto-detection fails and performance will be degraded.
Prerequisites
- Access to a Crusoe Cloud Project With Appropriate Permissions
- One or More InfiniBand-Supported VMs Running
ubuntu-rocm:latest - SSH Access to All VMs (If Applicable)
Instructions
-
Install Dependencies
$ sudo apt-get update $ sudo apt-get install -y git cmake libcap-dev
-
Set Up Build Directories
$ export INSTALL_DIR=$HOME/ompi_for_gpu $ export BUILD_DIR=/tmp/ompi_for_gpu_build $ mkdir -p $BUILD_DIR
-
Build and Install UCX With ROCm Support
-
UCX is used internally by OpenMPI for transport. It must be built with ROCm support enabled.
$ export UCX_DIR=$INSTALL_DIR/ucx $ cd $BUILD_DIR $ git clone https://github.com/openucx/ucx.git -b v1.15.x $ cd ucx $ ./autogen.sh $ mkdir build && cd build $ ../configure --prefix=$UCX_DIR --with-rocm=/opt/rocm $ make -j $(nproc) $ make -j $(nproc) install
-
-
Build and Install OpenMPI With ROCm Support
-
OpenMPI enables collective communications across multiple nodes and must also be built with ROCm support enabled.
$ export OMPI_DIR=$INSTALL_DIR/ompi $ cd $BUILD_DIR $ git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x $ cd ompi $ ./autogen.pl $ mkdir build && cd build $ ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR --with-rocm=/opt/rocm $ make -j $(nproc) $ make install
-
-
Update Environment Variables
$ export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib $ export PATH=$OMPI_DIR/bin:$PATH
-
Build and Install RCCL
$ cd $HOME $ git clone https://github.com/ROCm/rccl.git $ cd rccl $ ./install.sh -d
ℹ️ Note: The linking step takes approximately 5 minutes to complete.
-
Enable P2P GPU Transport
$ export HSA_FORCE_FINE_GRAIN_PCIE=1
-
Clone and Build RCCL Tests
$ cd $HOME $ git clone https://github.com/ROCm/rccl-tests.git $ cd rccl-tests $ make MPI=1 MPI_HOME=$HOME/ompi_for_gpu/ompi HIP_HOME=/opt/rocm/bin/hipcc RCCL_HOME=$HOME/rccl
-
Run the All-Reduce Test
-
For a single-node test (stresses the local InfiniBand fabric across 8 GPUs):
$ $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -g 8
-
For a multi-node test, first create a hostfile using the private IP addresses of your VMs:
$ cat << EOF > $HOME/hostfile <node_ip1> slots=8 <node_ip2> slots=8 EOF
-
Then run across nodes:
$ $OMPI_DIR/bin/mpirun -np 16 -hostfile $HOME/hostfile \ -x PATH -x LD_LIBRARY_PATH -x HSA_FORCE_FINE_GRAIN_PCIE=1 \ $HOME/rccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 1
-
Verify Topology Configuration
Unlike bare-metal, a VM cannot auto-detect PCI topology. The ubuntu-rocm:latest image handles this by pre-shipping the topology file at /etc/crusoe/rccl_topo/mi300x-192gb-ib.xml and /etc/rccl.conf. Before running any tests, verify these files exist on your VM:
$ ls /etc/crusoe/rccl_topo/mi300x-192gb-ib.xml $ ls /etc/rccl.conf
No additional topology configuration is needed if using the ubuntu-rocm:latest curated image.
Example
A healthy test will show zero errors and bus bandwidth scaling with message size. Below is an example of expected output from a single-node test across 8 MI300X GPUs. Focus on the #wrong column (all zeros = no errors) and the busbw column at large message sizes (128 MB+) where peak bandwidth should reach approximately 314 GB/s:
$HOME/rccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -g 8
[1782340907.303489] [mi300x-test:211747:0] parser.c:2036 UCX WARN unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1782340907.303489] [mi300x-test:211747:0] parser.c:2036 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 8 minBytes 8 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop_deprecated:40b1b17
# Using devices
# Rank 0 Group 0 Pid 211747 on mi300x-test device 0 [0002:00:01] AMD Instinct MI300X
# Rank 1 Group 0 Pid 211747 on mi300x-test device 1 [0002:00:02] AMD Instinct MI300X
# Rank 2 Group 0 Pid 211747 on mi300x-test device 2 [0002:00:03] AMD Instinct MI300X
# Rank 3 Group 0 Pid 211747 on mi300x-test device 3 [0002:00:04] AMD Instinct MI300X
# Rank 4 Group 0 Pid 211747 on mi300x-test device 4 [0003:00:01] AMD Instinct MI300X
# Rank 5 Group 0 Pid 211747 on mi300x-test device 5 [0003:00:02] AMD Instinct MI300X
# Rank 6 Group 0 Pid 211747 on mi300x-test device 6 [0003:00:03] AMD Instinct MI300X
# Rank 7 Group 0 Pid 211747 on mi300x-test device 7 [0003:00:04] AMD Instinct MI300X
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 62.11 0.00 0.00 0 58.03 0.00 0.00 0
16 4 float sum -1 58.57 0.00 0.00 0 62.26 0.00 0.00 0
32 8 float sum -1 60.79 0.00 0.00 0 62.18 0.00 0.00 0
64 16 float sum -1 62.08 0.00 0.00 0 65.97 0.00 0.00 0
128 32 float sum -1 61.15 0.00 0.00 0 61.49 0.00 0.00 0
256 64 float sum -1 60.89 0.00 0.01 0 61.66 0.00 0.01 0
512 128 float sum -1 60.93 0.01 0.01 0 61.93 0.01 0.01 0
1024 256 float sum -1 61.94 0.02 0.03 0 66.42 0.02 0.03 0
2048 512 float sum -1 60.90 0.03 0.06 0 59.13 0.03 0.06 0
4096 1024 float sum -1 60.11 0.07 0.12 0 61.27 0.07 0.12 0
8192 2048 float sum -1 60.81 0.13 0.24 0 60.91 0.13 0.24 0
16384 4096 float sum -1 66.11 0.25 0.43 0 61.99 0.26 0.46 0
32768 8192 float sum -1 63.01 0.52 0.91 0 62.55 0.52 0.92 0
65536 16384 float sum -1 62.68 1.05 1.83 0 62.07 1.06 1.85 0
131072 32768 float sum -1 63.92 2.05 3.59 0 62.78 2.09 3.65 0
262144 65536 float sum -1 74.87 3.50 6.13 0 69.43 3.78 6.61 0
524288 131072 float sum -1 78.32 6.69 11.71 0 82.54 6.35 11.12 0
1048576 262144 float sum -1 99.16 10.57 18.50 0 95.92 10.93 19.13 0
2097152 524288 float sum -1 96.20 21.80 38.15 0 97.61 21.49 37.60 0
4194304 1048576 float sum -1 101.1 41.50 72.63 0 97.04 43.22 75.64 0
8388608 2097152 float sum -1 112.1 74.82 130.93 0 114.6 73.23 128.15 0
16777216 4194304 float sum -1 173.5 96.72 169.26 0 187.2 89.64 156.87 0
33554432 8388608 float sum -1 260.5 128.83 225.45 0 273.1 122.86 215.00 0
67108864 16777216 float sum -1 442.8 151.55 265.21 0 456.6 146.97 257.20 0
134217728 33554432 float sum -1 801.3 167.50 293.13 0 808.7 165.97 290.45 0
268435456 67108864 float sum -1 1544.5 173.80 304.15 0 1568.4 171.15 299.51 0
536870912 134217728 float sum -1 3039.8 176.62 309.08 0 3046.2 176.25 308.43 0
1073741824 268435456 float sum -1 6004.3 178.83 312.95 0 6009.7 178.67 312.67 0
2147483648 536870912 float sum -1 11960 179.56 314.23 0 11977 179.30 313.77 0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth : 84.7975
#
# Collective test concluded: all_reduce_perfPeak bus bandwidth observed was 314.23 GB/s across 1 MI300X node. The Avg bus bandwidth figure is computed across all message sizes including small, latency-bound messages that show near-zero bandwidth — focus on large message sizes (128 MB+) as the meaningful performance indicator.