Skip to main content
Crusoe Support Help Center home page
Crusoe

NCCL Hangs and Multi-Node Training Stalls Caused by Failed nvidia-fabricmanager

Sagar Lulla
Sagar Lulla
Updated

Last Updated: March 26, 2026

Problem

Multi-node training jobs or NCCL (NVIDIA Collective Communications Library) tests hang indefinitely without throwing a clear error code. The hang typically occurs during NCCL initialization.

NCCL debug logs will show the process completing channel and tree setup but then stalling indefinitely. A typical example of where the hang occurs:

<hostname>:2336862:2336938 [4] NCCL INFO P2P Chunksize set to 131072
<hostname>:2336858:2336932 [0] NCCL INFO comm 0x5606c3cf37b0 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  0:  0  8
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  1:  1  9
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  2:  2 10
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  3:  3 11
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  4:  4 12
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  5:  5 13
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  6:  6 14
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  7:  7 15
<hostname>:2336858:2336932 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9
<hostname>:2336858:2336932 [0] NCCL INFO Channel 01/16 :    0   9  10  11  12  13  14  15   8   1   2   3   4   5   6   7
...
<hostname>:2336858:2336932 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 7/-1/-1->0->1 ...
<hostname>:2336858:2336932 [0] NCCL INFO P2P Chunksize set to 131072
<--- hangs here, no further output --->

If logs are sparse, set the following environment variables before running your job to confirm the hang location:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

Cause

The nvidia-fabricmanager service has entered a failed state on one or more GPU nodes in the cluster. The Fabric Manager service is required for NVLink-enabled GPUs (such as A100s and H100s) to establish inter-GPU communication. If it fails on even a single node participating in a multi-node job, the entire NCCL communication ring will stall, causing the hang.

Running systemctl status nvidia-fabricmanager on the affected node(s) will show the service in a failed state:

$ systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Mon 2026-03-23 02:14:37 UTC; 3h 12min ago
    Process: 48291 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
   Main PID: 48291 (code=exited, status=1/FAILURE)
        CPU: 1.204s

Mar 23 02:14:35 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: fabric Manager starting up. Version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: Fabric Manager CUDA driver interface version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Failed to query NVSwitch devices from driver. Check that the NVIDIA kernel module and NVSwitch driver are loaded.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Aborting Fabric Manager.
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.

Additional detail can be found using journalctl:

$ journalctl -u nvidia-fabricmanager --no-pager -n 20
-- Logs begin at Fri 2026-03-21 00:00:00 UTC, end at Mon 2026-03-23 05:27:01 UTC. --
Mar 23 02:14:35 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: fabric Manager starting up. Version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: Fabric Manager CUDA driver interface version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Failed to query NVSwitch devices from driver. Check that the NVIDIA kernel module and NVSwitch driver are loaded.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Aborting Fabric Manager.
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.

Prerequisites

  • SSH access to the cluster nodes.
  • sudo or root privileges to manage system services.
  • The list of nodes currently exhibiting the hang behavior.
  • (Optional) pdsh or a similar tool installed for running commands on multiple nodes simultaneously.

Solution

1. Identify the affected node(s)

Check the status of nvidia-fabricmanager on all nodes participating in the job:

systemctl status nvidia-fabricmanager

Look for a status of failedinactive, or dead.

For multiple nodes, use pdsh to check them simultaneously:

pdsh -w <node-list> "systemctl status nvidia-fabricmanager | head -5"

2. Restart the service

Single node:

sudo systemctl restart nvidia-fabricmanager

Multiple nodes (via pdsh):

pdsh -w <node-list> "sudo systemctl restart nvidia-fabricmanager"

3. Verify the fix

Confirm the service is now active (running):

$ systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
     Active: active (running) since Mon 2026-03-23 05:30:12 UTC; 5s ago
   Main PID: 52104 (nv-fabricmanager)
      Tasks: 18 (limit: 1648576)
     Memory: 42.3M
        CPU: 2.107s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─52104 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Mar 23 05:30:12 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: fabric Manager starting up. Version: 550.127.05
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: Successfully queried 6 NVSwitch devices.
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: Fabric Manager is running and monitoring GPU fabric.

Then re-run a small NCCL test (e.g., on 2 nodes) to confirm connectivity is restored before resubmitting the full training job.

If the Service Fails Again Immediately After Restart

  • Driver version mismatch: Ensure the NVIDIA driver version matches the installed Fabric Manager version. Check with nvidia-smi and nv-fabricmanager --version.
  • Kernel module not loaded: Verify the NVSwitch kernel module is loaded: lsmod | grep nvidia.
  • Persistent NVSwitch errors: Check journalctl -u nvidia-fabricmanager and /var/log/syslog for detailed errors. NVSwitch hardware failures may require node maintenance.

If the Hang Persists After Fabric Manager is Running

  • Verify that hostnames resolve correctly across all nodes.
  • Check that no OS firewall rules are dropping traffic on the NCCL ports.
  • Confirm NVLink topology is healthy with nvidia-smi nvlink -s.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.