Last Updated: April 30, 2026
Introduction
InfiniBand (IB) is a high-bandwidth, low-latency interconnect used for GPU-to-GPU communication in multi-node training. In Crusoe VMs, IB NICs are exposed via SR-IOV: the physical HCA (Host Channel Adapter) is partitioned into virtual functions (VFs), and each VM receives one or more VFs rather than direct access to the physical function (PF).
This virtualization creates a diagnostic blind spot. Because the VF presents as a complete logical NIC to the guest, fabric-side tooling (switch telemetry, physical port state) may appear healthy even when the VF is experiencing errors. Conversely, hardware-level inspection tools like mlxlink and mst require PF access and will not work inside a VM.
When a multi-node NCCL job hangs at initialization, the most likely causes are IB link state issues, elevated error counters on the VF, or driver/firmware faults surfaced only in dmesg. This guide walks through how to gather that diagnostic data from inside the affected VMs so that either you or Crusoe support can isolate the root cause.
Prerequisites
- SSH Access to the Affected VMs
- Affected Node Hostnames (e.g. from
scontrol show job <job_id>) - Basic Linux CLI Familiarity
-
ibstatInstalled -
perfqueryInstalled -
ethtoolInstalled -
dmesgAvailable on the Node
Instructions
Step 1: Identify Device and Interface Names
Before running diagnostics, get the names of the IB devices and network interfaces on the node:
ls /sys/class/infiniband/ # IB device names e.g. mlx5_1, mlx5_2 ls /sys/class/infiniband/*/ports # port numbers per device ip link show # network interface names e.g. ib0, eth0
Use the device and interface names returned here in all subsequent commands.
Step 2: Check IB Port State and Link Health
cat /sys/class/infiniband/*/ports/*/state cat /sys/class/infiniband/*/ports/*/phys_state cat /sys/class/infiniband/*/ports/*/rate
What to look for:
-
stateshould be4: ACTIVE— anything else (e.g.1: DOWN,2: INIT,3: ARMED) indicates a problem. -
phys_stateshould be5: LinkUp—3: Pollingsuggests the port is trying and failing to establish a link. -
rateshould match the expected link speed for your instance type.
Step 3: Check IB Error Counters
cat /sys/class/infiniband/*/ports/*/counters/* cat /sys/class/infiniband/*/ports/*/hw_counters/* # Also via perfquery (replace mlx5_1 and port 1 as needed) perfquery -E mlx5_1 1 perfquery -E -x mlx5_1 1 perfquery -E -X mlx5_1 1
Any non-zero values in the following counters during a job run are significant:
-
symbol_error/SymbolErrors -
port_rcv_errors/RcvErrors local_link_integrity_errors-
link_downed/LinkDownedCounter -
out_of_sequence,implied_nak_seq_err,local_ack_timeout_err(hw_counters)
Even small non-zero counts on these during a job run are significant.
Step 4: Check Global Interface Stats
ip -s link show ibstatus ibstat ethtool <interface> # e.g. ethtool ib0 ethtool -S <interface>
Step 5: Check dmesg for NIC/IB Errors
dmesg | grep -iE "error|fail|link|mlx|infiniband|rdma|ib[0-9]|eth[0-9]" | tail -100
This often surfaces link flap events, driver errors, or firmware faults that do not appear in port state or counter files.
Step 6: Run a Baseline NCCL Test
Before or after gathering the above, run an all-reduce NCCL test between the affected nodes to confirm whether the IB transport is functional. See How-To: Validate InfiniBand Performance with NCCL All-Reduce Test for setup and expected output.
Known Limitations
- mlxlink and mst status require direct PCIe access to the physical HCA and will fail inside a VM. If needed, Crusoe support can run these from the physical host side.
- hw_counters reflect the virtual function (VF) view, not the physical function (PF). They are still useful for catching errors but may not show the full picture.
- If all in-VM diagnostics look clean, escalate to Crusoe support to run a BMC audit and physical host-level NIC inspection.