PyTorch NCCL DNS Resolution Failures During Distributed Training Initialization

Overview

During distributed training initialization, PyTorch's c10d communication library may fail to resolve IPv6 addresses for peer nodes, ultimately timing out and preventing training from starting. This manifests as repeated gai error: -3 - Temporary failure in name resolution warnings in training logs, followed by a TCP connection timeout after 30 minutes.

The suspected cause is systemd-resolved entering a failed or degraded state on the affected VM — though this specific trigger has not yet been fully validated and is still under investigation. Restarting the VM, which restarts the service, has resolved the issue in observed cases.

Importantly, the DNS resolution failure occurs on the affected VM (the one logging the errors), even though the error message references a different peer node as the target.

Prerequisites

SSH Access to the Affected VM
Access to the Crusoe Console or API

Steps

Check the Affected VM's State
- In the Crusoe console or via the API, check whether the VM is in an ERROR state.
- An ERROR state indicates a possible infrastructure-level issue on the underlying host and is a strong signal to contact Crusoe Support before taking further action.
- To identify which VM is affected in a large distributed job, correlate the failing rank number in the training logs with the corresponding VM. For example, if rank 64 of 256 is logging errors, that maps to a specific VM in your fleet.
Back Up Any Data on the Affected VM
- If the VM is still accessible via SSH, back up any in-progress work, checkpoints, or datasets to a persistent disk before proceeding.
- Recovery actions (VM restart or host-level intervention by Crusoe Support) may result in loss of data stored on ephemeral/local disks.
Contact Crusoe Support if the VM Is in ERROR State
- If the VM is in ERROR state, open a support ticket and share the VM ID and the training logs showing the gai error: -3 warnings.
- Crusoe Support will investigate the underlying host and perform any necessary recovery actions (e.g., host reboot or VM migration).
Restart the VM
- Once the host has been confirmed healthy (or if the VM is not in ERROR state), stop and restart the VM.
- A restart will clear any failed systemd-resolved state, which is the suspected cause of the DNS resolution failures.
- After restart, verify the service is healthy:

     systemctl status systemd-resolved
     journalctl -u systemd-resolved --since "1 hour ago"

Re-Run Distributed Training
- Retry NCCL initialization. If DNS resolution is healthy, the gai error: -3 warnings should not reappear and distributed training should initialize successfully.

Resolution

The following describes how this issue was resolved in a confirmed case:

Customer reported repeated gai error: -3 warnings on one node in a 256-GPU distributed training job, followed by a 30-minute TCP timeout.
The affected VM was found to be in an ERROR state. Crusoe Support was engaged.
The VM was stopped and the customer confirmed no active workload or data requiring preservation was on the node.
Crusoe Support performed a host-level power cycle.
The VM was restarted and came up healthy on a new host.
The customer confirmed NCCL initialization completed successfully on retry.

Additional Resources

Related to

nccl dns pytorch solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Prerequisites

Steps

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

PyTorch NCCL DNS Resolution Failures During Distributed Training Initialization

Overview

Prerequisites

Steps

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments