Last Updated: March 26, 2026
Problem
Multi-node training jobs or NCCL (NVIDIA Collective Communications Library) tests hang indefinitely without throwing a clear error code. The hang typically occurs during NCCL initialization.
NCCL debug logs will show the process completing channel and tree setup but then stalling indefinitely. A typical example of where the hang occurs:
<hostname>:2336862:2336938 [4] NCCL INFO P2P Chunksize set to 131072 <hostname>:2336858:2336932 [0] NCCL INFO comm 0x5606c3cf37b0 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 0: 0 8 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 1: 1 9 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 2: 2 10 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 3: 3 11 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 4: 4 12 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 5: 5 13 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 6: 6 14 <hostname>:2336858:2336932 [0] NCCL INFO NVLS Head 7: 7 15 <hostname>:2336858:2336932 [0] NCCL INFO Channel 00/16 : 0 7 6 5 4 3 2 1 8 15 14 13 12 11 10 9 <hostname>:2336858:2336932 [0] NCCL INFO Channel 01/16 : 0 9 10 11 12 13 14 15 8 1 2 3 4 5 6 7 ... <hostname>:2336858:2336932 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] 7/-1/-1->0->1 ... <hostname>:2336858:2336932 [0] NCCL INFO P2P Chunksize set to 131072 <--- hangs here, no further output --->
If logs are sparse, set the following environment variables before running your job to confirm the hang location:
export NCCL_DEBUG=INFO export NCCL_DEBUG_SUBSYS=ALL
Cause
The nvidia-fabricmanager service has entered a failed state on one or more GPU nodes in the cluster. The Fabric Manager service is required for NVLink-enabled GPUs (such as A100s and H100s) to establish inter-GPU communication. If it fails on even a single node participating in a multi-node job, the entire NCCL communication ring will stall, causing the hang.
Running systemctl status nvidia-fabricmanager on the affected node(s) will show the service in a failed state:
$ systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
Active: failed (Result: exit-code) since Mon 2026-03-23 02:14:37 UTC; 3h 12min ago
Process: 48291 ExecStart=/usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg (code=exited, status=1/FAILURE)
Main PID: 48291 (code=exited, status=1/FAILURE)
CPU: 1.204s
Mar 23 02:14:35 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: fabric Manager starting up. Version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: Fabric Manager CUDA driver interface version: 550.127.05
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Failed to query NVSwitch devices from driver. Check that the NVIDIA kernel module and NVSwitch driver are loaded.
Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Aborting Fabric Manager.
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Main process exited, code=exited, status=1/FAILURE
Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.Additional detail can be found using journalctl:
$ journalctl -u nvidia-fabricmanager --no-pager -n 20 -- Logs begin at Fri 2026-03-21 00:00:00 UTC, end at Mon 2026-03-23 05:27:01 UTC. -- Mar 23 02:14:35 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service. Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: fabric Manager starting up. Version: 550.127.05 Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: Fabric Manager CUDA driver interface version: 550.127.05 Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Failed to query NVSwitch devices from driver. Check that the NVIDIA kernel module and NVSwitch driver are loaded. Mar 23 02:14:37 <hostname> nv-fabricmanager[48291]: ERROR: Aborting Fabric Manager. Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Main process exited, code=exited, status=1/FAILURE Mar 23 02:14:37 <hostname> systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Prerequisites
- SSH access to the cluster nodes.
-
sudoor root privileges to manage system services. - The list of nodes currently exhibiting the hang behavior.
- (Optional)
pdshor a similar tool installed for running commands on multiple nodes simultaneously.
Solution
1. Identify the affected node(s)
Check the status of nvidia-fabricmanager on all nodes participating in the job:
systemctl status nvidia-fabricmanager
Look for a status of failed, inactive, or dead.
For multiple nodes, use pdsh to check them simultaneously:
pdsh -w <node-list> "systemctl status nvidia-fabricmanager | head -5"
2. Restart the service
Single node:
sudo systemctl restart nvidia-fabricmanager
Multiple nodes (via pdsh):
pdsh -w <node-list> "sudo systemctl restart nvidia-fabricmanager"
3. Verify the fix
Confirm the service is now active (running):
$ systemctl status nvidia-fabricmanager
● nvidia-fabricmanager.service - NVIDIA Fabric Manager Service
Loaded: loaded (/usr/lib/systemd/system/nvidia-fabricmanager.service; enabled; preset: disabled)
Active: active (running) since Mon 2026-03-23 05:30:12 UTC; 5s ago
Main PID: 52104 (nv-fabricmanager)
Tasks: 18 (limit: 1648576)
Memory: 42.3M
CPU: 2.107s
CGroup: /system.slice/nvidia-fabricmanager.service
└─52104 /usr/bin/nv-fabricmanager -c /usr/share/nvidia/nvswitch/fabricmanager.cfg
Mar 23 05:30:12 <hostname> systemd[1]: Started NVIDIA Fabric Manager Service.
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: fabric Manager starting up. Version: 550.127.05
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: Successfully queried 6 NVSwitch devices.
Mar 23 05:30:14 <hostname> nv-fabricmanager[52104]: Fabric Manager is running and monitoring GPU fabric.Then re-run a small NCCL test (e.g., on 2 nodes) to confirm connectivity is restored before resubmitting the full training job.
If the Service Fails Again Immediately After Restart
-
Driver version mismatch: Ensure the NVIDIA driver version matches the installed Fabric Manager version. Check with
nvidia-smiandnv-fabricmanager --version. -
Kernel module not loaded: Verify the NVSwitch kernel module is loaded:
lsmod | grep nvidia. -
Persistent NVSwitch errors: Check
journalctl -u nvidia-fabricmanagerand/var/log/syslogfor detailed errors. NVSwitch hardware failures may require node maintenance.
If the Hang Persists After Fabric Manager is Running
- Verify that hostnames resolve correctly across all nodes.
- Check that no OS firewall rules are dropping traffic on the NCCL ports.
- Confirm NVLink topology is healthy with
nvidia-smi nvlink -s.