Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To: Diagnose InfiniBand NIC Issues for NCCL Initialization Hangs

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Last Updated: April 30, 2026

Introduction

InfiniBand (IB) is a high-bandwidth, low-latency interconnect used for GPU-to-GPU communication in multi-node training. In Crusoe VMs, IB NICs are exposed via SR-IOV: the physical HCA (Host Channel Adapter) is partitioned into virtual functions (VFs), and each VM receives one or more VFs rather than direct access to the physical function (PF).

This virtualization creates a diagnostic blind spot. Because the VF presents as a complete logical NIC to the guest, fabric-side tooling (switch telemetry, physical port state) may appear healthy even when the VF is experiencing errors. Conversely, hardware-level inspection tools like mlxlink and mst require PF access and will not work inside a VM.

When a multi-node NCCL job hangs at initialization, the most likely causes are IB link state issues, elevated error counters on the VF, or driver/firmware faults surfaced only in dmesg. This guide walks through how to gather that diagnostic data from inside the affected VMs so that either you or Crusoe support can isolate the root cause.

Prerequisites

  • SSH Access to the Affected VMs
  • Affected Node Hostnames (e.g. from scontrol show job <job_id>)
  • Basic Linux CLI Familiarity
  • ibstat Installed
  • perfquery Installed
  • ethtool Installed
  • dmesg Available on the Node

Instructions

Step 1: Identify Device and Interface Names

Before running diagnostics, get the names of the IB devices and network interfaces on the node:

ls /sys/class/infiniband/        # IB device names e.g. mlx5_1, mlx5_2
ls /sys/class/infiniband/*/ports # port numbers per device
ip link show                     # network interface names e.g. ib0, eth0

Use the device and interface names returned here in all subsequent commands.

Step 2: Check IB Port State and Link Health

cat /sys/class/infiniband/*/ports/*/state
cat /sys/class/infiniband/*/ports/*/phys_state
cat /sys/class/infiniband/*/ports/*/rate

What to look for:

  • state should be 4: ACTIVE — anything else (e.g. 1: DOWN, 2: INIT, 3: ARMED) indicates a problem.
  • phys_state should be 5: LinkUp3: Polling suggests the port is trying and failing to establish a link.
  • rate should match the expected link speed for your instance type.

Step 3: Check IB Error Counters

cat /sys/class/infiniband/*/ports/*/counters/*
cat /sys/class/infiniband/*/ports/*/hw_counters/*

# Also via perfquery (replace mlx5_1 and port 1 as needed)
perfquery -E mlx5_1 1
perfquery -E -x mlx5_1 1
perfquery -E -X mlx5_1 1

Any non-zero values in the following counters during a job run are significant:

  • symbol_error / SymbolErrors
  • port_rcv_errors / RcvErrors
  • local_link_integrity_errors
  • link_downed / LinkDownedCounter
  • out_of_sequence, implied_nak_seq_err, local_ack_timeout_err (hw_counters)

Even small non-zero counts on these during a job run are significant.

Step 4: Check Global Interface Stats

ip -s link show
ibstatus
ibstat
ethtool <interface>       # e.g. ethtool ib0
ethtool -S <interface>

Step 5: Check dmesg for NIC/IB Errors

dmesg | grep -iE "error|fail|link|mlx|infiniband|rdma|ib[0-9]|eth[0-9]" | tail -100

This often surfaces link flap events, driver errors, or firmware faults that do not appear in port state or counter files.

Step 6: Run a Baseline NCCL Test

Before or after gathering the above, run an all-reduce NCCL test between the affected nodes to confirm whether the IB transport is functional. See How-To: Validate InfiniBand Performance with NCCL All-Reduce Test for setup and expected output.

Known Limitations

  • mlxlink and mst status require direct PCIe access to the physical HCA and will fail inside a VM. If needed, Crusoe support can run these from the physical host side.
  • hw_counters reflect the virtual function (VF) view, not the physical function (PF). They are still useful for catching errors but may not show the full picture.
  • If all in-VM diagnostics look clean, escalate to Crusoe support to run a BMC audit and physical host-level NIC inspection.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Related Articles

Recently Viewed

Comments

0 comments

Article is closed for comments.