Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Run NCCL Tests using SLURM

Randall Gee
Randall Gee
Updated

Background

In order to create workload on Infiniband, and test bandwidth performance, you can run a NCCL test using our SLURM solution.

 

Prerequisites

Step by Step

1. Create a file terraform.tfvars and fill in the needed environment variables. If you need to add additional SLURM users, add their SSH key.

# common configuration
# fill in data
location = "us-east1-a" #us-southcentral1-a
project_id = " "
ssh_public_key_path = "~/.ssh/id_ed25519.pub"
# crusoe networking vpc-subnets list
vpc_subnet_id = "80cf6356-da22-42c8-8925-d5fb1e3e55ad"
# slurm-compute-node configuration
slurm_compute_node_type = "h100-80gb-sxm-ib.8x" # "a100-80gb-sxm-ib.8x"
slurm_compute_node_ib_network_id = " "
slurm_compute_node_count = 2

# slurm users configuration
slurm_users = [{
name = "user1"
uid = 1001
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ user1@crusoe.ai"
}, {
name = "user2"
uid = 1002
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ user2@crusoe.ai"
}]

2. After the environment variables have been filled, run the following commands to initialize and apply the changes in Terraform

terraform init

terraform plan

terraform apply

3. The SLURM solution does take some time to complete but after the deployment finishes, SSH into the login node.

4. Now that you're SSH'd into the login node, from the login node you can run NCCL tests to validate Infiniband performance.

Make sure the following are in your /etc/nccl.conf file

NCCL_SOCKET_NTHREADS=4
NCCL_NSOCKS_PERTHREAD=8
NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml
NCCL_IB_MERGE_VFS=0
NCCL_IB_HCA=^mlx5_0:1
# NCCL_DEBUG=WARN # for more verbose logging
NCCL_ALGO=NVLSTree

5. Then run a NCCL tests across your compute nodes.

srun -N 2 --ntasks-per-node=8 --gres=gpu:8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

-N 2 is the number of compute nodes in your SLURM cluster.

Below is output for a 2-node test.

# nThread 1 nGpus 1 minBytes 8 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid 133881 on slurm-compute-node-0 device  0 [0x00] NVIDIA H100 80GB HBM3
#  Rank  1 Group  0 Pid 133882 on slurm-compute-node-0 device  1 [0x00] NVIDIA H100 80GB HBM3
#  Rank  2 Group  0 Pid 133883 on slurm-compute-node-0 device  2 [0x00] NVIDIA H100 80GB HBM3
#  Rank  3 Group  0 Pid 133884 on slurm-compute-node-0 device  3 [0x00] NVIDIA H100 80GB HBM3
#  Rank  4 Group  0 Pid 133885 on slurm-compute-node-0 device  4 [0x00] NVIDIA H100 80GB HBM3
#  Rank  5 Group  0 Pid 133886 on slurm-compute-node-0 device  5 [0x00] NVIDIA H100 80GB HBM3
#  Rank  6 Group  0 Pid 133887 on slurm-compute-node-0 device  6 [0x00] NVIDIA H100 80GB HBM3
#  Rank  7 Group  0 Pid 133888 on slurm-compute-node-0 device  7 [0x00] NVIDIA H100 80GB HBM3
#  Rank  8 Group  0 Pid 132251 on slurm-compute-node-1 device  0 [0x00] NVIDIA H100 80GB HBM3
#  Rank  9 Group  0 Pid 132252 on slurm-compute-node-1 device  1 [0x00] NVIDIA H100 80GB HBM3
#  Rank 10 Group  0 Pid 132253 on slurm-compute-node-1 device  2 [0x00] NVIDIA H100 80GB HBM3
#  Rank 11 Group  0 Pid 132254 on slurm-compute-node-1 device  3 [0x00] NVIDIA H100 80GB HBM3
#  Rank 12 Group  0 Pid 132255 on slurm-compute-node-1 device  4 [0x00] NVIDIA H100 80GB HBM3
#  Rank 13 Group  0 Pid 132256 on slurm-compute-node-1 device  5 [0x00] NVIDIA H100 80GB HBM3
#  Rank 14 Group  0 Pid 132257 on slurm-compute-node-1 device  6 [0x00] NVIDIA H100 80GB HBM3
#  Rank 15 Group  0 Pid 132258 on slurm-compute-node-1 device  7 [0x00] NVIDIA H100 80GB HBM3
#
#                                                              out-of-place                       in-place        
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       

           8             2     float     sum      -1    55.86    0.00    0.00      0    24.21    0.00    0.00      0
          16             4     float     sum      -1    24.31    0.00    0.00      0    24.26    0.00    0.00      0
          32             8     float     sum      -1    24.58    0.00    0.00      0    26.04    0.00    0.00      0
          64            16     float     sum      -1    25.00    0.00    0.00      0    24.88    0.00    0.00      0
         128            32     float     sum      -1    25.23    0.01    0.01      0    25.12    0.01    0.01      0
         256            64     float     sum      -1    33.39    0.01    0.01      0    25.71    0.01    0.02      0
         512           128     float     sum      -1    27.83    0.02    0.03      0    25.75    0.02    0.04      0
        1024           256     float     sum      -1    26.41    0.04    0.07      0    26.25    0.04    0.07      0
        2048           512     float     sum      -1    27.57    0.07    0.14      0    27.51    0.07    0.14      0
        4096          1024     float     sum      -1    29.41    0.14    0.26      0    29.21    0.14    0.26      0
        8192          2048     float     sum      -1    30.29    0.27    0.51      0    29.88    0.27    0.51      0
       16384          4096     float     sum      -1    31.62    0.52    0.97      0    30.95    0.53    0.99      0
       32768          8192     float     sum      -1    32.69    1.00    1.88      0    32.14    1.02    1.91      0
       65536         16384     float     sum      -1    33.06    1.98    3.72      0    32.55    2.01    3.78      0
      131072         32768     float     sum      -1    38.15    3.44    6.44      0    37.38    3.51    6.57      0
      262144         65536     float     sum      -1    46.09    5.69   10.66      0    45.92    5.71   10.70      0
      524288        131072     float     sum      -1    52.46    9.99   18.74      0    52.46    9.99   18.74      0
     1048576        262144     float     sum      -1    107.8    9.72   18.23      0    74.05   14.16   26.55      0
     2097152        524288     float     sum      -1    79.29   26.45   49.59      0    78.91   26.57   49.83      0
     4194304       1048576     float     sum      -1    112.4   37.31   69.95      0    97.04   43.22   81.04      0
     8388608       2097152     float     sum      -1    136.9   61.27  114.88      0    136.6   61.43  115.18      0
    16777216       4194304     float     sum      -1    185.5   90.44  169.58      0    184.7   90.82  170.28      0
    33554432       8388608     float     sum      -1    266.5  125.91  236.08      0    266.2  126.04  236.33      0
    67108864      16777216     float     sum      -1    478.0  140.39  263.23      0    475.8  141.05  264.47      0
   134217728      33554432     float     sum      -1    752.2  178.44  334.57      0    750.3  178.88  335.39      0
   268435456      67108864     float     sum      -1   1290.6  207.99  389.98      0   1291.7  207.81  389.64      0
   536870912     134217728     float     sum      -1   2371.1  226.42  424.54      0   2378.2  225.75  423.28      0
  1073741824     268435456     float     sum      -1   4511.2  238.02  446.28      0   4508.2  238.17  446.57      0
  2147483648     536870912     float     sum      -1   8753.1  245.34  460.01      0   8742.8  245.63  460.55      0

# Out of bounds values : 0 OK
# Avg bus bandwidth    : 104.539

 

Additional Resources

Nvidia Environment Variables

GitHub SLURM

Getting Started with Terraform

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.