Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To: Validate Infiniband Performance with RCCL All Reduce Test

Irman Mashiana
Irman Mashiana
Updated

Introduction

Crusoe Cloud MI300X VMs are equipped with high-performance NVIDIA Mellanox InfiniBand networking. RCCL (ROCm Collective Communications Library) is AMD's equivalent of NVIDIA's NCCL — it handles collective communications (all-reduce, broadcast, etc.) across GPUs and nodes, and is the standard tool for validating IB fabric health before scaling distributed training workloads.

Running an RCCL all-reduce test confirms that your InfiniBand interconnect is functioning correctly and performing at expected bandwidth. This is particularly useful before launching large multi-node training jobs where a degraded fabric will silently bottleneck throughput.

Crusoe strongly recommends using the ubuntu-rocm:latest image (currently 22.04-6.2), which ships with ROCm pre-installed and includes Crusoe-specific RCCL topology files at /etc/crusoe/rccl_topo/ and /etc/rccl.conf. These topology files are required for RCCL to correctly map GPU and IB topology on a VM — without them, auto-detection fails and performance will be degraded.

Prerequisites

  • Access to a Crusoe Cloud Project With Appropriate Permissions
  • One or More InfiniBand-Supported VMs Running ubuntu-rocm:latest
  • SSH Access to All VMs (If Applicable)

Instructions

  1. Install Dependencies
    • $ sudo apt-get update
      $ sudo apt-get install -y git cmake libcap-dev
  2. Set Up Build Directories
    • $ export INSTALL_DIR=$HOME/ompi_for_gpu
      $ export BUILD_DIR=/tmp/ompi_for_gpu_build
      $ mkdir -p $BUILD_DIR
  3. Build and Install UCX With ROCm Support
    • UCX is used internally by OpenMPI for transport. It must be built with ROCm support enabled.

      $ export UCX_DIR=$INSTALL_DIR/ucx
      $ cd $BUILD_DIR
      $ git clone https://github.com/openucx/ucx.git -b v1.15.x
      $ cd ucx
      $ ./autogen.sh
      $ mkdir build && cd build
      $ ../configure --prefix=$UCX_DIR --with-rocm=/opt/rocm
      $ make -j $(nproc)
      $ make -j $(nproc) install
  4. Build and Install OpenMPI With ROCm Support
    • OpenMPI enables collective communications across multiple nodes and must also be built with ROCm support enabled.

      $ export OMPI_DIR=$INSTALL_DIR/ompi
      $ cd $BUILD_DIR
      $ git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x
      $ cd ompi
      $ ./autogen.pl
      $ mkdir build && cd build
      $ ../configure --prefix=$OMPI_DIR --with-ucx=$UCX_DIR --with-rocm=/opt/rocm
      $ make -j $(nproc)
      $ make install
  5. Update Environment Variables
    • $ export LD_LIBRARY_PATH=$OMPI_DIR/lib:$UCX_DIR/lib:/opt/rocm/lib
      $ export PATH=$OMPI_DIR/bin:$PATH
  6. Build and Install RCCL
    • $ cd $HOME
      $ git clone https://github.com/ROCm/rccl.git
      $ cd rccl
      $ ./install.sh -d
    • ℹ️ Note: The linking step takes approximately 5 minutes to complete.

  7. Enable P2P GPU Transport
    • $ export HSA_FORCE_FINE_GRAIN_PCIE=1
  8. Clone and Build RCCL Tests
    • $ cd $HOME
      $ git clone https://github.com/ROCm/rccl-tests.git
      $ cd rccl-tests
      $ make MPI=1 MPI_HOME=$HOME/ompi_for_gpu/ompi HIP_HOME=/opt/rocm/bin/hipcc RCCL_HOME=$HOME/rccl
  9. Run the All-Reduce Test
    • For a single-node test (stresses the local InfiniBand fabric across 8 GPUs):

      $ $HOME/rccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -g 8
    • For a multi-node test, first create a hostfile using the private IP addresses of your VMs:

      $ cat << EOF > $HOME/hostfile
      <node_ip1> slots=8
      <node_ip2> slots=8
      EOF
    • Then run across nodes:

      $ $OMPI_DIR/bin/mpirun -np 16 -hostfile $HOME/hostfile \
        -x PATH -x LD_LIBRARY_PATH -x HSA_FORCE_FINE_GRAIN_PCIE=1 \
        $HOME/rccl-tests/build/all_reduce_perf -b 128M -e 2G -f 2 -g 1

Verify Topology Configuration

Unlike bare-metal, a VM cannot auto-detect PCI topology. The ubuntu-rocm:latest image handles this by pre-shipping the topology file at /etc/crusoe/rccl_topo/mi300x-192gb-ib.xml and /etc/rccl.conf. Before running any tests, verify these files exist on your VM:

$ ls /etc/crusoe/rccl_topo/mi300x-192gb-ib.xml
$ ls /etc/rccl.conf

No additional topology configuration is needed if using the ubuntu-rocm:latest curated image.

Example

A healthy test will show zero errors and bus bandwidth scaling with message size. Below is an example of expected output from a single-node test across 8 MI300X GPUs. Focus on the #wrong column (all zeros = no errors) and the busbw column at large message sizes (128 MB+) where peak bandwidth should reach approximately 314 GB/s:

$HOME/rccl-tests/build/all_reduce_perf -b 8 -e 2G -f 2 -g 8
[1782340907.303489] [mi300x-test:211747:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_DIR (maybe: UCX_TLS?)
[1782340907.303489] [mi300x-test:211747:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
# Collective test starting: all_reduce_perf
# nThread 1 nGpus 8 minBytes 8 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
rccl-tests: Version develop_deprecated:40b1b17
# Using devices
#  Rank  0 Group  0 Pid 211747 on mi300x-test device  0 [0002:00:01] AMD Instinct MI300X
#  Rank  1 Group  0 Pid 211747 on mi300x-test device  1 [0002:00:02] AMD Instinct MI300X
#  Rank  2 Group  0 Pid 211747 on mi300x-test device  2 [0002:00:03] AMD Instinct MI300X
#  Rank  3 Group  0 Pid 211747 on mi300x-test device  3 [0002:00:04] AMD Instinct MI300X
#  Rank  4 Group  0 Pid 211747 on mi300x-test device  4 [0003:00:01] AMD Instinct MI300X
#  Rank  5 Group  0 Pid 211747 on mi300x-test device  5 [0003:00:02] AMD Instinct MI300X
#  Rank  6 Group  0 Pid 211747 on mi300x-test device  6 [0003:00:03] AMD Instinct MI300X
#  Rank  7 Group  0 Pid 211747 on mi300x-test device  7 [0003:00:04] AMD Instinct MI300X
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     sum      -1    62.11    0.00    0.00      0    58.03    0.00    0.00      0
          16             4     float     sum      -1    58.57    0.00    0.00      0    62.26    0.00    0.00      0
          32             8     float     sum      -1    60.79    0.00    0.00      0    62.18    0.00    0.00      0
          64            16     float     sum      -1    62.08    0.00    0.00      0    65.97    0.00    0.00      0
         128            32     float     sum      -1    61.15    0.00    0.00      0    61.49    0.00    0.00      0
         256            64     float     sum      -1    60.89    0.00    0.01      0    61.66    0.00    0.01      0
         512           128     float     sum      -1    60.93    0.01    0.01      0    61.93    0.01    0.01      0
        1024           256     float     sum      -1    61.94    0.02    0.03      0    66.42    0.02    0.03      0
        2048           512     float     sum      -1    60.90    0.03    0.06      0    59.13    0.03    0.06      0
        4096          1024     float     sum      -1    60.11    0.07    0.12      0    61.27    0.07    0.12      0
        8192          2048     float     sum      -1    60.81    0.13    0.24      0    60.91    0.13    0.24      0
       16384          4096     float     sum      -1    66.11    0.25    0.43      0    61.99    0.26    0.46      0
       32768          8192     float     sum      -1    63.01    0.52    0.91      0    62.55    0.52    0.92      0
       65536         16384     float     sum      -1    62.68    1.05    1.83      0    62.07    1.06    1.85      0
      131072         32768     float     sum      -1    63.92    2.05    3.59      0    62.78    2.09    3.65      0
      262144         65536     float     sum      -1    74.87    3.50    6.13      0    69.43    3.78    6.61      0
      524288        131072     float     sum      -1    78.32    6.69   11.71      0    82.54    6.35   11.12      0
     1048576        262144     float     sum      -1    99.16   10.57   18.50      0    95.92   10.93   19.13      0
     2097152        524288     float     sum      -1    96.20   21.80   38.15      0    97.61   21.49   37.60      0
     4194304       1048576     float     sum      -1    101.1   41.50   72.63      0    97.04   43.22   75.64      0
     8388608       2097152     float     sum      -1    112.1   74.82  130.93      0    114.6   73.23  128.15      0
    16777216       4194304     float     sum      -1    173.5   96.72  169.26      0    187.2   89.64  156.87      0
    33554432       8388608     float     sum      -1    260.5  128.83  225.45      0    273.1  122.86  215.00      0
    67108864      16777216     float     sum      -1    442.8  151.55  265.21      0    456.6  146.97  257.20      0
   134217728      33554432     float     sum      -1    801.3  167.50  293.13      0    808.7  165.97  290.45      0
   268435456      67108864     float     sum      -1   1544.5  173.80  304.15      0   1568.4  171.15  299.51      0
   536870912     134217728     float     sum      -1   3039.8  176.62  309.08      0   3046.2  176.25  308.43      0
  1073741824     268435456     float     sum      -1   6004.3  178.83  312.95      0   6009.7  178.67  312.67      0
  2147483648     536870912     float     sum      -1    11960  179.56  314.23      0    11977  179.30  313.77      0
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 84.7975
#
# Collective test concluded: all_reduce_perf

Peak bus bandwidth observed was 314.23 GB/s across 1 MI300X node. The Avg bus bandwidth figure is computed across all message sizes including small, latency-bound messages that show near-zero bandwidth — focus on large message sizes (128 MB+) as the meaningful performance indicator.

Related Articles

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Related Articles

Recently Viewed

Comments

0 comments

Article is closed for comments.