Introduction
When InfiniBand (IB)-enabled GPU nodes—such as H100, H200, or A100 SXM—are launched within a Crusoe Managed Kubernetes (CMK) node pool, there may be times when you need to verify that the InfiniBand networking is delivering the performance you expect. A straightforward and reliable way to do this is by running an NVIDIA's Collective Communications Library (NCCL) All-Reduce test. This benchmark is designed to exercise the high-speed interconnects between nodes, helping you confirm that the IB fabric is operating correctly and ready to support your distributed workloads with expected efficiency.
Prerequisites
Before starting, ensure you have the following:
- Access to a Crusoe Cloud project with appropriate permissions
- Kubeconfig to access your CMK cluster
- Access to Crusoe CLI or Console
- A CMK cluster with the latest image version (>= 1.30.8-cmk.23)
- CMK-enabled or manually installed GPU and Network Operator
Step-by-Step Instructions
You can utilize NCCL to validate the performance of the IB networking stack once at least 2 Infiniband supported CMK nodes are launched. To do so, please follow these steps:
-
The Kubeflow MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes. Use the following command to deploy the operator:
Note: Feel free to use the latest version if needed. More info here: Kubeflow MPI Operator.
# kubectl apply --server-side -f https://raw.githubusercontent.com/kubeflow/mpi-operator/v0.6.0/deploy/v2beta1/mpi-operator.yaml
-
Run a sample all-reduce NCCL test with the following config:
# kubectl create namespace nccl-test
# kubectl -n nccl-test create -f nccl-test-crusoe.yaml ---- # cat nccl-test-crusoe.yaml apiVersion: kubeflow.org/v2beta1 kind: MPIJob metadata: name: nccl-tests-gdr-16 spec: slotsPerWorker: 8 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: restartPolicy: OnFailure initContainers: - image: crusoecloud/nccl-tests:h100-23.10-py3 imagePullPolicy: Always name: init command: ["sh", "-c", "sleep 5"] volumes: - name: nccl-topo hostPath: path: /etc/crusoe/nccl_topo type: Directory containers: - image: crusoecloud/nccl-tests:h100-23.10-py3 imagePullPolicy: Always volumeMounts: - name: nccl-topo mountPath: /opt/nccl_topo name: nccl-test-launcher securityContext: capabilities: add: ["IPC_LOCK"] env: - name: NCCL_TOPO_FILE value: /opt/nccl_topo/h100-80gb-sxm-ib-cloud-hypervisor.xml - name: UCX_RNDV_SCHEME value: "get_zcopy" # UCX memory setting - name: UCX_TLS value: "self,sm,cuda_copy" # UCX memory setting command: - /opt/hpcx/ompi/bin/mpirun - --allow-run-as-root - --tag-output - -np - "16" # Total number of processes (8 GPUs per node, 2 nodes = 16 total if using 2 Worker Replicas) - -bind-to - none - -map-by - slot - -mca - coll_hcoll_enable - "0" - -x - NCCL_IB_PCI_RELAXED_ORDERING=1 - -x - NCCL_IB_SPLIT_DATA_ON_QPS=0 - -x - NCCL_IB_QPS_PER_CONNECTION=2 - -x - NCCL_IB_MERGE_VFS=0 - -x - NCCL_IB_HCA=^mlx5_0:1 - -x - NCCL_IBEXT_DISABLE=1 - -x - NCCL_TOPO_FILE - -x - PATH - -x - LD_LIBRARY_PATH - -x - NCCL_DEBUG=TRACE - -x - NCCL_ALGO=NVLSTree - /opt/nccl-tests/build/all_reduce_perf - -b - "8" - -e - "2G" - -f - "2" - -t - "1" - -g - "1" - -c - "1" - -n - "100" Worker: replicas: 2 # Specify how many worker nodes you have running in the Instances tab template: spec: restartPolicy: OnFailure runtimeClassName: nvidia volumes: - name: dshm emptyDir: medium: Memory sizeLimit: 64Gi - name: nccl-topo hostPath: path: /etc/crusoe/nccl_topo type: Directory containers: - image: crusoecloud/nccl-tests:h100-23.10-py3 imagePullPolicy: Always name: nccl-worker securityContext: capabilities: add: ["IPC_LOCK"] env: - name: NCCL_DEBUG value: TRACE - name: UCX_RNDV_SCHEME value: "get_zcopy" # UCX memory setting - name: UCX_TLS value: "self,sm,cuda_copy" # UCX memory setting volumeMounts: - mountPath: /dev/shm name: dshm - name: nccl-topo mountPath: /opt/nccl_topo resources: limits: nvidia.com/gpu: 8 # 8 GPUs per node nvidia.com/hostdev: 8 memory: 128000Mi requests: nvidia.com/gpu: 8 # 8 GPUs per node nvidia.com/hostdev: 8 memory: 128000Mi -
Validate the Bus bandwidth has the expected GB/s performance by querying the nccl-tests-gdr-16-launcher pod.
# kubectl get pods | grep nccl-tests-gdr-16-launcher
# kubectl logs nccl-tests-gdr-16-launcher-<>Example test results -
[1,0]<stdout>:# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
[1,0]<stdout>:# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
[1,3]<stdout>:nccl-tests-gdr-16-worker-0:51:108 [3] NCCL INFO comm 0x555f1bf42e00 rank 3 nranks 16 cudaDev 3 nvmlDev 3 busId 200040 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,4]<stdout>:nccl-tests-gdr-16-worker-0:52:110 [4] NCCL INFO comm 0x564b4f2908b0 rank 4 nranks 16 cudaDev 4 nvmlDev 4 busId 300010 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,2]<stdout>:nccl-tests-gdr-16-worker-0:50:111 [2] NCCL INFO comm 0x558c2133d1b0 rank 2 nranks 16 cudaDev 2 nvmlDev 2 busId 200030 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,1]<stdout>:nccl-tests-gdr-16-worker-0:49:109 [1] NCCL INFO comm 0x563da3d5d560 rank 1 nranks 16 cudaDev 1 nvmlDev 1 busId 200020 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,9]<stdout>:nccl-tests-gdr-16-worker-1:43:105 [1] NCCL INFO comm 0x559a9f467260 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 200020 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,13]<stdout>:nccl-tests-gdr-16-worker-1:47:102 [5] NCCL INFO comm 0x55f98734eed0 rank 13 nranks 16 cudaDev 5 nvmlDev 5 busId 300020 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,15]<stdout>:nccl-tests-gdr-16-worker-1:49:103 [7] NCCL INFO comm 0x55bde359eca0 rank 15 nranks 16 cudaDev 7 nvmlDev 7 busId 300040 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,11]<stdout>:nccl-tests-gdr-16-worker-1:45:98 [3] NCCL INFO comm 0x558d6cc9b890 rank 11 nranks 16 cudaDev 3 nvmlDev 3 busId 200040 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,5]<stdout>:nccl-tests-gdr-16-worker-0:53:107 [5] NCCL INFO comm 0x562d74f516d0 rank 5 nranks 16 cudaDev 5 nvmlDev 5 busId 300020 commId 0x2ec93f92e7198bc6 - Init COMPLETE
[1,9]<stdout>:nccl-tests-gdr-16-worker-1:43:178 [1] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,13]<stdout>:nccl-tests-gdr-16-worker-1:47:184 [5] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,4]<stdout>:nccl-tests-gdr-16-worker-0:52:191 [4] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,15]<stdout>:nccl-tests-gdr-16-worker-1:49:185 [7] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,2]<stdout>:nccl-tests-gdr-16-worker-0:50:186 [2] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,5]<stdout>:nccl-tests-gdr-16-worker-0:53:190 [5] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,1]<stdout>:nccl-tests-gdr-16-worker-0:49:187 [1] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,6]<stdout>:nccl-tests-gdr-16-worker-0:54:188 [6] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,3]<stdout>:nccl-tests-gdr-16-worker-0:51:192 [3] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,7]<stdout>:nccl-tests-gdr-16-worker-0:56:189 [7] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,12]<stdout>:nccl-tests-gdr-16-worker-1:46:181 [4] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,0]<stdout>:nccl-tests-gdr-16-worker-0:48:185 [0] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,10]<stdout>:nccl-tests-gdr-16-worker-1:44:182 [2] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,8]<stdout>:nccl-tests-gdr-16-worker-1:42:179 [0] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,11]<stdout>:nccl-tests-gdr-16-worker-1:45:180 [3] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,14]<stdout>:nccl-tests-gdr-16-worker-1:48:183 [6] NCCL INFO NCCL_IB_SPLIT_DATA_ON_QPS set by environment to 0.
[1,0]<stdout>: 8 2 float sum -1 67.82 0.00 0.00 0 26.24 0.00 0.00 0
[1,0]<stdout>: 16 4 float sum -1 26.27 0.00 0.00 0 25.90 0.00 0.00 0
[1,0]<stdout>: 32 8 float sum -1 27.35 0.00 0.00 0 26.39 0.00 0.00 0
[1,0]<stdout>: 64 16 float sum -1 26.39 0.00 0.00 0 26.82 0.00 0.00 0
[1,0]<stdout>: 128 32 float sum -1 27.47 0.00 0.01 0 26.65 0.00 0.01 0
[1,0]<stdout>: 256 64 float sum -1 36.02 0.01 0.01 0 26.86 0.01 0.02 0
[1,0]<stdout>: 512 128 float sum -1 28.87 0.02 0.03 0 27.80 0.02 0.03 0
[1,0]<stdout>: 1024 256 float sum -1 30.12 0.03 0.06 0 29.10 0.04 0.07 0
[1,0]<stdout>: 2048 512 float sum -1 30.99 0.07 0.12 0 30.45 0.07 0.13 0
[1,0]<stdout>: 4096 1024 float sum -1 31.80 0.13 0.24 0 30.97 0.13 0.25 0
[1,0]<stdout>: 8192 2048 float sum -1 32.62 0.25 0.47 0 31.22 0.26 0.49 0
[1,0]<stdout>: 16384 4096 float sum -1 34.38 0.48 0.89 0 32.88 0.50 0.93 0
[1,0]<stdout>: 32768 8192 float sum -1 34.61 0.95 1.78 0 33.77 0.97 1.82 0
[1,0]<stdout>: 65536 16384 float sum -1 40.47 1.62 3.04 0 39.91 1.64 3.08 0
[1,0]<stdout>: 131072 32768 float sum -1 61.54 2.13 3.99 0 43.56 3.01 5.64 0
[1,0]<stdout>: 262144 65536 float sum -1 51.92 5.05 9.47 0 53.20 4.93 9.24 0
[1,0]<stdout>: 524288 131072 float sum -1 144.6 3.63 6.80 0 85.67 6.12 11.47 0
[1,0]<stdout>: 1048576 262144 float sum -1 101.0 10.38 19.47 0 115.9 9.05 16.96 0
[1,0]<stdout>: 2097152 524288 float sum -1 153.4 13.67 25.63 0 143.4 14.62 27.41 0
[1,0]<stdout>: 4194304 1048576 float sum -1 123.5 33.97 63.70 0 134.3 31.23 58.56 0
[1,0]<stdout>: 8388608 2097152 float sum -1 153.0 54.81 102.77 0 168.7 49.72 93.22 0
[1,0]<stdout>: 16777216 4194304 float sum -1 223.7 75.01 140.65 0 221.1 75.89 142.30 0
[1,0]<stdout>: 33554432 8388608 float sum -1 285.0 117.74 220.76 0 263.5 127.34 238.76 0
[1,0]<stdout>: 67108864 16777216 float sum -1 499.8 134.27 251.76 0 496.5 135.15 253.41 0
[1,0]<stdout>: 134217728 33554432 float sum -1 746.4 179.83 337.18 0 769.7 174.38 326.97 0
[1,0]<stdout>: 268435456 67108864 float sum -1 1375.1 195.22 366.03 0 1325.7 202.49 379.67 0
[1,0]<stdout>: 536870912 134217728 float sum -1 2516.9 213.31 399.95 0 2502.0 214.58 402.33 0
[1,0]<stdout>: 1073741824 268435456 float sum -1 4613.9 232.72 436.35 0 4612.4 232.80 436.49 0
[1,0]<stdout>: 2147483648 536870912 float sum -1 9283.6 231.32 433.73 0 9237.9 232.46 435.87 0
[1,0]<stdout>: 4294967296 1073741824 float sum -1 17581 244.30 458.06 0 17627 243.66 456.86 0
[1,0]<stdout>: 8589934592 2147483648 float sum -1 34165 251.42 471.42 0 34190 251.24 471.07 0
[1,0]<stdout>: 17179869184 4294967296 float sum -1 67205 255.63 479.31 0 67185 255.71 479.45 0
[1,5]<stdout>:nccl-tests-gdr-16-worker-0:53:53 [5] NCCL INFO comm 0x562d74f516d0 rank 5 nranks 16 cudaDev 5 busId 300020 - Destroy COMPLETE
[1,11]<stdout>:nccl-tests-gdr-16-worker-1:45:45 [3] NCCL INFO comm 0x558d6cc9b890 rank 11 nranks 16 cudaDev 3 busId 200040 - Destroy COMPLETE
[1,0]<stdout>:nccl-tests-gdr-16-worker-0:48:48 [0] NCCL INFO comm 0x5576bbc25120 rank 0 nranks 16 cudaDev 0 busId 200010 - Destroy COMPLETE
[1,2]<stdout>:nccl-tests-gdr-16-worker-0:50:50 [2] NCCL INFO comm 0x558c2133d1b0 rank 2 nranks 16 cudaDev 2 busId 200030 - Destroy COMPLETE
[1,13]<stdout>:nccl-tests-gdr-16-worker-1:47:47 [5] NCCL INFO comm 0x55f98734eed0 rank 13 nranks 16 cudaDev 5 busId 300020 - Destroy COMPLETE
[1,10]<stdout>:nccl-tests-gdr-16-worker-1:44:44 [2] NCCL INFO comm 0x55da0cfaeb00 rank 10 nranks 16 cudaDev 2 busId 200030 - Destroy COMPLETE
[1,0]<stdout>:# Out of bounds values : 0 OK
[1,0]<stdout>:# Avg bus bandwidth : 132.597
[1,0]<stdout>:#
Comments
0 comments
Please sign in to leave a comment.