Skip to main content
Crusoe Support Help Center home page
Crusoe

Validate InfiniBand Performance with NCCL on Crusoe Managed Kubernetes (CMK) cluster

Apeksha Khilari
Apeksha Khilari
Updated

Last Updated: June 3, 2026

Overview

After provisioning InfiniBand (IB) enabled GPU node pools for SKUs like H100, H200, B200, or GB200 on a Crusoe Managed Kubernetes (CMK) cluster, it is recommended to validate that the InfiniBand network is delivering the expected performance. One common approach is to run the NVIDIA Collective Communications Library (NCCL) All-Reduce benchmark. This test exercises the high-speed interconnects between nodes and helps verify that the InfiniBand fabric is functioning correctly, providing the bandwidth and latency characteristics required for efficient distributed GPU workloads.

Prerequisites

  • Access to a Crusoe Cloud project with appropriate permissions.
  • Crusoe Managed Kubernetes (CMK) cluster with an InfiniBand supported GPU node pool provisioned for the target SKU containing at least two nodes.
  • MPI Operator installed on the cluster. The manifests use kubeflow.org/v2beta1 MPIJob

    kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
  • For GB200 SKU: Ensure the NVIDIA GPU Feature Discovery and Dynamic Resource Allocation (DRA) components are enabled, as the GB200 manifest uses resource.nvidia.com/v1beta1 ComputeDomain.

Step-by-Step Instructions

  1. Run All-Reduce NCCL test

    Navigate to CMK-NCCL repo which contains Kubernetes manifests for running NCCL tests on Crusoe Managed Kubernetes (CMK) clusters. Each manifest runs an all_reduce_perf benchmark as an MPIJob and is tuned for a specific Crusoe GPU SKU.

  2. Validate Bus BW

    Check the <launcher-pod-name> pod to inspect the measured bus bandwidth has expected GB/s performance. Contact Crusoe Cloud Support for any performance variance.

    Sample Results:

     2147483648     536870912     float     sum      -1   7190.2  298.67  595.95      0   7078.8  303.37  605.33      0
     4294967296    1073741824     float     sum      -1    12734  337.27  672.98      0    12711  337.90  674.24      0
     8589934592    2147483648     float     sum      -1    24096  356.48  711.31      0    23492  365.66  729.62      0
    17179869184    4294967296     float     sum      -1    45876  374.49  747.24      0    45166  380.37  758.98      0
    34359738368    8589934592     float     sum      -1    88928  386.38  770.96      0    88580  387.90  774.00      0
    
    # Out of bounds values : 0 OK
    # Avg bus bandwidth    : 704.062 
    # Collective test concluded: all_reduce_perf

Additional resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.