NCCL Hangs and Multi-Node Training Stalls Caused by Failed nvidia-fabricmanager

Last Updated: March 26, 2026

Problem

Multi-node training jobs or NCCL (NVIDIA Collective Communications Library) tests hang indefinitely without throwing a clear error code. The hang typically occurs during NCCL initialization.

NCCL debug logs will show the process completing channel and tree setup but then stalling indefinitely. A typical example of where the hang occurs:

<hostname>:2336862:2336938 [4] NCCL INFO P2P Chunksize set to 131072
<hostname>:2336858:2336932 [0] NCCL INFO comm 0x5606c3cf37b0 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  0:  0  8
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  1:  1  9
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  2:  2 10
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  3:  3 11
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  4:  4 12
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  5:  5 13
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  6:  6 14
<hostname>:2336858:2336932 [0] NCCL INFO NVLS Head  7:  7 15
<hostname>:2336858:2336932 [0] NCCL INFO Channel 00/16 :    0   7   6   5   4   3   2   1   8  15  14  13  12  11  10   9
<hostname>:2336858:2336932 [0] NCCL INFO Channel 01/16 :    0   9  10  11  12  13  14  15   8   1   2   3   4   5   6   7
...
<hostname>:2336858:2336932 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 7/-1/-1->0->1 ...
<hostname>:2336858:2336932 [0] NCCL INFO P2P Chunksize set to 131072
<--- hangs here, no further output --->

If logs are sparse, set the following environment variables before running your job to confirm the hang location:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL

Cause

The nvidia-fabricmanager service has entered a failed state on one or more GPU nodes in the cluster. The Fabric Manager service is required for NVLink-enabled GPUs (such as A100s and H100s) to establish inter-GPU communication. If it fails on even a single node participating in a multi-node job, the entire NCCL communication ring will stall, causing the hang.

Running systemctl status nvidia-fabricmanager on the affected node(s) will show the service in a failed state:

$ systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
     Active: failed (Result: exit-code) since Mon 2026-03-23 02:14:37 UTC; 3h 12min ago
    Process: 48291 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
   Main PID: 48291 (code=exited, status=1/FAILURE)
        CPU: 1.204s

Mar 23 02:14:35 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: fabric Manager starting up. Version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: Fabric Manager CUDA driver interface version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Failed to query NVSwitch devices from driver. Check that the NVIDIA kernel module and NVSwitch driver are loaded.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Aborting Fabric Manager.
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.

Additional detail can be found using journalctl:

$ journalctl -u nvidia-fabricmanager --no-pager -n 20
-- Logs begin at Fri 2026-03-21 00:00:00 UTC, end at Mon 2026-03-23 05:27:01 UTC. --
Mar 23 02:14:35 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: fabric Manager starting up. Version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: Fabric Manager CUDA driver interface version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Failed to query NVSwitch devices from driver. Check that the NVIDIA kernel module and NVSwitch driver are loaded.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Aborting Fabric Manager.
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.

Prerequisites

SSH access to the cluster nodes.
sudo or root privileges to manage system services.
The list of nodes currently exhibiting the hang behavior.
(Optional) pdsh or a similar tool installed for running commands on multiple nodes simultaneously.

Solution

1. Identify the affected node(s)

Check the status of nvidia-fabricmanager on all nodes participating in the job:

systemctl status nvidia-fabricmanager

Look for a status of failed, inactive, or dead.

For multiple nodes, use pdsh to check them simultaneously:

pdsh -w <node-list> "systemctl status nvidia-fabricmanager | head -5"

2. Restart the service

Single node:

sudo systemctl restart nvidia-fabricmanager

Multiple nodes (via pdsh):

pdsh -w <node-list> "sudo systemctl restart nvidia-fabricmanager"

3. Verify the fix

Confirm the service is now active (running):

$ systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Service
     Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
     Active: active (running) since Mon 2026-03-23 05:30:12 UTC; 5s ago
   Main PID: 52104 (nv-fabricmanager)
      Tasks: 18 (limit: 1648576)
     Memory: 42.3M
        CPU: 2.107s
     CGroup: /system.slice/nvidia-fabricmanager.service
             └─52104 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg

Mar 23 05:30:12 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: fabric Manager starting up. Version: 550.127.05
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: Successfully queried 6 NVSwitch devices.
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: Fabric Manager is running and monitoring GPU fabric.

Then re-run a small NCCL test (e.g., on 2 nodes) to confirm connectivity is restored before resubmitting the full training job.

If the Service Fails Again Immediately After Restart

Driver version mismatch: Ensure the NVIDIA driver version matches the installed Fabric Manager version. Check with nvidia-smi and nv-fabricmanager --version.
Kernel module not loaded: Verify the NVSwitch kernel module is loaded: lsmod | grep nvidia.
Persistent NVSwitch errors: Check journalctl -u nvidia-fabricmanager and /var/log/syslog for detailed errors. NVSwitch hardware failures may require node maintenance.

If the Hang Persists After Fabric Manager is Running

Verify that hostnames resolve correctly across all nodes.
Check that no OS firewall rules are dropping traffic on the NCCL ports.
Confirm NVLink topology is healthy with nvidia-smi nvlink -s.

Additional Resources

NVIDIA Fabric Manager User Guide

Related to

nccl failure troubleshoot hangs solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Problem

Cause

Prerequisites

Solution

If the Service Fails Again Immediately After Restart

If the Hang Persists After Fabric Manager is Running

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

NCCL Hangs and Multi-Node Training Stalls Caused by Failed nvidia-fabricmanager

Problem

Cause

Prerequisites

Solution

If the Service Fails Again Immediately After Restart

If the Hang Persists After Fabric Manager is Running

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments