Overview
During distributed training initialization, PyTorch's c10d communication library may fail to resolve IPv6 addresses for peer nodes, ultimately timing out and preventing training from starting. This manifests as repeated gai error: -3 - Temporary failure in name resolution warnings in training logs, followed by a TCP connection timeout after 30 minutes.
The suspected cause is systemd-resolved entering a failed or degraded state on the affected VM — though this specific trigger has not yet been fully validated and is still under investigation. Restarting the VM, which restarts the service, has resolved the issue in observed cases.
Importantly, the DNS resolution failure occurs on the affected VM (the one logging the errors), even though the error message references a different peer node as the target.
Prerequisites
- SSH Access to the Affected VM
- Access to the Crusoe Console or API
Steps
-
Check the Affected VM's State
- In the Crusoe console or via the API, check whether the VM is in an ERROR state.
- An ERROR state indicates a possible infrastructure-level issue on the underlying host and is a strong signal to contact Crusoe Support before taking further action.
- To identify which VM is affected in a large distributed job, correlate the failing rank number in the training logs with the corresponding VM. For example, if rank 64 of 256 is logging errors, that maps to a specific VM in your fleet.
-
Back Up Any Data on the Affected VM
- If the VM is still accessible via SSH, back up any in-progress work, checkpoints, or datasets to a persistent disk before proceeding.
- Recovery actions (VM restart or host-level intervention by Crusoe Support) may result in loss of data stored on ephemeral/local disks.
-
Contact Crusoe Support if the VM Is in ERROR State
- If the VM is in ERROR state, open a support ticket and share the VM ID and the training logs showing the
gai error: -3warnings. - Crusoe Support will investigate the underlying host and perform any necessary recovery actions (e.g., host reboot or VM migration).
- If the VM is in ERROR state, open a support ticket and share the VM ID and the training logs showing the
-
Restart the VM
- Once the host has been confirmed healthy (or if the VM is not in ERROR state), stop and restart the VM.
- A restart will clear any failed
systemd-resolvedstate, which is the suspected cause of the DNS resolution failures. - After restart, verify the service is healthy:
systemctl status systemd-resolved
journalctl -u systemd-resolved --since "1 hour ago"-
Re-Run Distributed Training
- Retry NCCL initialization. If DNS resolution is healthy, the
gai error: -3warnings should not reappear and distributed training should initialize successfully.
- Retry NCCL initialization. If DNS resolution is healthy, the
Resolution
The following describes how this issue was resolved in a confirmed case:
- Customer reported repeated
gai error: -3warnings on one node in a 256-GPU distributed training job, followed by a 30-minute TCP timeout. - The affected VM was found to be in an ERROR state. Crusoe Support was engaged.
- The VM was stopped and the customer confirmed no active workload or data requiring preservation was on the node.
- Crusoe Support performed a host-level power cycle.
- The VM was restarted and came up healthy on a new host.
- The customer confirmed NCCL initialization completed successfully on retry.