Background
In order to create workload on Infiniband, and test bandwidth performance, you can run a NCCL test using our SLURM solution.
Prerequisites
- Clone our GitHub SLURM solution
- Terraform Installed
Step by Step
1. Create a file terraform.tfvars and fill in the needed environment variables. If you need to add additional SLURM users, add their SSH key.
# common configuration
# fill in data
location = "us-east1-a" #us-southcentral1-a
project_id = " "
ssh_public_key_path = "~/.ssh/id_ed25519.pub"
# crusoe networking vpc-subnets list
vpc_subnet_id = "80cf6356-da22-42c8-8925-d5fb1e3e55ad"
# slurm-compute-node configuration
slurm_compute_node_type = "h100-80gb-sxm-ib.8x" # "a100-80gb-sxm-ib.8x"
slurm_compute_node_ib_network_id = " "
slurm_compute_node_count = 2
# slurm users configuration
slurm_users = [{
name = "user1"
uid = 1001
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ user1@crusoe.ai"
}, {
name = "user2"
uid = 1002
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIjPRr0iVR4mgzJy0ehnM5hWX4O86hM1bVTgdi5g3nkZ user2@crusoe.ai"
}]
2. After the environment variables have been filled, run the following commands to initialize and apply the changes in Terraform
terraform init
terraform plan
terraform apply
3. The SLURM solution does take some time to complete but after the deployment finishes, SSH into the login node.
4. Now that you're SSH'd into the login node, from the login node you can run NCCL tests to validate Infiniband performance.
Make sure the following are in your /etc/nccl.conf file
NCCL_SOCKET_NTHREADS=4
NCCL_NSOCKS_PERTHREAD=8
NCCL_TOPO_FILE=/etc/crusoe/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml
NCCL_IB_MERGE_VFS=0
NCCL_IB_HCA=^mlx5_0:1
# NCCL_DEBUG=WARN # for more verbose logging
NCCL_ALGO=NVLSTree
5. Then run a NCCL tests across your compute nodes.
srun -N 2 --ntasks-per-node=8 --gres=gpu:8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1
-N 2 is the number of compute nodes in your SLURM cluster.
Below is output for a 2-node test.
# nThread 1 nGpus 1 minBytes 8 maxBytes 2147483648 step: 2(factor) warmup iters: 5 iters: 100 agg iters: 1 validation: 1 graph: 0
#
# Using devices
# Rank 0 Group 0 Pid 133881 on slurm-compute-node-0 device 0 [0x00] NVIDIA H100 80GB HBM3
# Rank 1 Group 0 Pid 133882 on slurm-compute-node-0 device 1 [0x00] NVIDIA H100 80GB HBM3
# Rank 2 Group 0 Pid 133883 on slurm-compute-node-0 device 2 [0x00] NVIDIA H100 80GB HBM3
# Rank 3 Group 0 Pid 133884 on slurm-compute-node-0 device 3 [0x00] NVIDIA H100 80GB HBM3
# Rank 4 Group 0 Pid 133885 on slurm-compute-node-0 device 4 [0x00] NVIDIA H100 80GB HBM3
# Rank 5 Group 0 Pid 133886 on slurm-compute-node-0 device 5 [0x00] NVIDIA H100 80GB HBM3
# Rank 6 Group 0 Pid 133887 on slurm-compute-node-0 device 6 [0x00] NVIDIA H100 80GB HBM3
# Rank 7 Group 0 Pid 133888 on slurm-compute-node-0 device 7 [0x00] NVIDIA H100 80GB HBM3
# Rank 8 Group 0 Pid 132251 on slurm-compute-node-1 device 0 [0x00] NVIDIA H100 80GB HBM3
# Rank 9 Group 0 Pid 132252 on slurm-compute-node-1 device 1 [0x00] NVIDIA H100 80GB HBM3
# Rank 10 Group 0 Pid 132253 on slurm-compute-node-1 device 2 [0x00] NVIDIA H100 80GB HBM3
# Rank 11 Group 0 Pid 132254 on slurm-compute-node-1 device 3 [0x00] NVIDIA H100 80GB HBM3
# Rank 12 Group 0 Pid 132255 on slurm-compute-node-1 device 4 [0x00] NVIDIA H100 80GB HBM3
# Rank 13 Group 0 Pid 132256 on slurm-compute-node-1 device 5 [0x00] NVIDIA H100 80GB HBM3
# Rank 14 Group 0 Pid 132257 on slurm-compute-node-1 device 6 [0x00] NVIDIA H100 80GB HBM3
# Rank 15 Group 0 Pid 132258 on slurm-compute-node-1 device 7 [0x00] NVIDIA H100 80GB HBM3
#
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
8 2 float sum -1 55.86 0.00 0.00 0 24.21 0.00 0.00 0
16 4 float sum -1 24.31 0.00 0.00 0 24.26 0.00 0.00 0
32 8 float sum -1 24.58 0.00 0.00 0 26.04 0.00 0.00 0
64 16 float sum -1 25.00 0.00 0.00 0 24.88 0.00 0.00 0
128 32 float sum -1 25.23 0.01 0.01 0 25.12 0.01 0.01 0
256 64 float sum -1 33.39 0.01 0.01 0 25.71 0.01 0.02 0
512 128 float sum -1 27.83 0.02 0.03 0 25.75 0.02 0.04 0
1024 256 float sum -1 26.41 0.04 0.07 0 26.25 0.04 0.07 0
2048 512 float sum -1 27.57 0.07 0.14 0 27.51 0.07 0.14 0
4096 1024 float sum -1 29.41 0.14 0.26 0 29.21 0.14 0.26 0
8192 2048 float sum -1 30.29 0.27 0.51 0 29.88 0.27 0.51 0
16384 4096 float sum -1 31.62 0.52 0.97 0 30.95 0.53 0.99 0
32768 8192 float sum -1 32.69 1.00 1.88 0 32.14 1.02 1.91 0
65536 16384 float sum -1 33.06 1.98 3.72 0 32.55 2.01 3.78 0
131072 32768 float sum -1 38.15 3.44 6.44 0 37.38 3.51 6.57 0
262144 65536 float sum -1 46.09 5.69 10.66 0 45.92 5.71 10.70 0
524288 131072 float sum -1 52.46 9.99 18.74 0 52.46 9.99 18.74 0
1048576 262144 float sum -1 107.8 9.72 18.23 0 74.05 14.16 26.55 0
2097152 524288 float sum -1 79.29 26.45 49.59 0 78.91 26.57 49.83 0
4194304 1048576 float sum -1 112.4 37.31 69.95 0 97.04 43.22 81.04 0
8388608 2097152 float sum -1 136.9 61.27 114.88 0 136.6 61.43 115.18 0
16777216 4194304 float sum -1 185.5 90.44 169.58 0 184.7 90.82 170.28 0
33554432 8388608 float sum -1 266.5 125.91 236.08 0 266.2 126.04 236.33 0
67108864 16777216 float sum -1 478.0 140.39 263.23 0 475.8 141.05 264.47 0
134217728 33554432 float sum -1 752.2 178.44 334.57 0 750.3 178.88 335.39 0
268435456 67108864 float sum -1 1290.6 207.99 389.98 0 1291.7 207.81 389.64 0
536870912 134217728 float sum -1 2371.1 226.42 424.54 0 2378.2 225.75 423.28 0
1073741824 268435456 float sum -1 4511.2 238.02 446.28 0 4508.2 238.17 446.57 0
2147483648 536870912 float sum -1 8753.1 245.34 460.01 0 8742.8 245.63 460.55 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 104.539