Last Updated: June 3, 2026
Overview
After provisioning InfiniBand (IB) enabled GPU node pools for SKUs like H100, H200, B200, or GB200 on a Crusoe Managed Kubernetes (CMK) cluster, it is recommended to validate that the InfiniBand network is delivering the expected performance. One common approach is to run the NVIDIA Collective Communications Library (NCCL) All-Reduce benchmark. This test exercises the high-speed interconnects between nodes and helps verify that the InfiniBand fabric is functioning correctly, providing the bandwidth and latency characteristics required for efficient distributed GPU workloads.
Prerequisites
- Access to a Crusoe Cloud project with appropriate permissions.
- Crusoe Managed Kubernetes (CMK) cluster with an InfiniBand supported GPU node pool provisioned for the target SKU containing at least two nodes.
-
MPI Operator installed on the cluster. The manifests use
kubeflow.org/v2beta1 MPIJob.kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml - For GB200 SKU: Ensure the NVIDIA GPU Feature Discovery and Dynamic Resource Allocation (DRA) components are enabled, as the GB200 manifest uses
resource.nvidia.com/v1beta1 ComputeDomain.
Step-by-Step Instructions
-
Run All-Reduce NCCL test
Navigate to CMK-NCCL repo which contains Kubernetes manifests for running NCCL tests on Crusoe Managed Kubernetes (CMK) clusters. Each manifest runs an
all_reduce_perfbenchmark as an MPIJob and is tuned for a specific Crusoe GPU SKU. -
Validate Bus BW
Check the
<launcher-pod-name>pod to inspect the measured bus bandwidth has expected GB/s performance. Contact Crusoe Cloud Support for any performance variance.Sample Results:
2147483648 536870912 float sum -1 7190.2 298.67 595.95 0 7078.8 303.37 605.33 0 4294967296 1073741824 float sum -1 12734 337.27 672.98 0 12711 337.90 674.24 0 8589934592 2147483648 float sum -1 24096 356.48 711.31 0 23492 365.66 729.62 0 17179869184 4294967296 float sum -1 45876 374.49 747.24 0 45166 380.37 758.98 0 34359738368 8589934592 float sum -1 88928 386.38 770.96 0 88580 387.90 774.00 0 # Out of bounds values : 0 OK # Avg bus bandwidth : 704.062 # Collective test concluded: all_reduce_perf