Introduction
When a distributed training job fails with an NCCL collective timeout, the default NCCL configuration often doesn't capture enough information to diagnose what went wrong. You may see a message like:
Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
NCCL's flight recorder is an in-memory ring buffer that logs every collective operation (all-reduce, all-gather, broadcast, etc.) as it executes. When it's disabled, you can see that the job crashed but not which specific operation was in flight, which rank caused the hang, or what the last successful collective was before the failure. Without this trace, NCCL timeout failures are a black box.
Setting the two environment variables below enables the flight recorder and configures PyTorch to automatically dump the trace to disk the moment a timeout is detected. That way, if the failure reproduces, you have the full operation sequence captured and preserved — even if the process exits abruptly.
Prerequisites
- PyTorch With NCCL Backend (
torch.distributed) - Ability to Set Environment Variables in Your Job Launch Script or Container
Instructions
-
Set the NCCL Debug Environment Variables
-
Add the following to your job launch script,
torchruncommand, or container environment before starting training:export TORCH_NCCL_TRACE_BUFFER_SIZE=2097152 export TORCH_NCCL_DUMP_ON_TIMEOUT=1
- These variables take effect on the next run. No training code changes are required.
-
TORCH_NCCL_TRACE_BUFFER_SIZE— Sets the size of the in-memory buffer (in bytes) used to record NCCL operations. A value of2097152(2 MB) is sufficient for most workloads. If you have very long runs with many collectives, you can increase this further. -
TORCH_NCCL_DUMP_ON_TIMEOUT— Tells PyTorch to automatically dump the flight recorder trace to disk when a collective timeout is detected, so the data is preserved even after the process crashes. Set to1to enable.
-
-
-
Collect the Debug Output on Failure
- If a timeout occurs, PyTorch writes a debug dump to the working directory of the process (typically named
nccl_trace_*.jsonor similar, depending on your PyTorch version). Share the following with Crusoe Support when opening a ticket:- The dump file produced by
TORCH_NCCL_DUMP_ON_TIMEOUT - The full NCCL log output (set
NCCL_DEBUG=INFOif not already enabled) - The rank and node where the timeout was first reported
- The dump file produced by
- If a timeout occurs, PyTorch writes a debug dump to the working directory of the process (typically named
Example
A customer running a distributed training job across 8 H100 nodes sees the job fail with an NCCL timeout after several hours. The error message includes the flight recorder notice, but no stack trace. They set TORCH_NCCL_TRACE_BUFFER_SIZE=2097152 and TORCH_NCCL_DUMP_ON_TIMEOUT=1 in their torchrun launch script and rerun. When the timeout reproduces, a dump file is written to the working directory showing that rank 3 on node gpu-worker-7 was stuck in an all-reduce operation while the other ranks were waiting at a subsequent barrier. The customer attaches this dump, the NCCL debug logs, and the affected node name to a Crusoe support ticket, and the networking team is able to trace the stall to a specific InfiniBand link.