Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Enable NCCL Debug Logging for Collective Timeout Failures

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Introduction

When a distributed training job fails with an NCCL collective timeout, the default NCCL configuration often doesn't capture enough information to diagnose what went wrong. You may see a message like:

Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.

NCCL's flight recorder is an in-memory ring buffer that logs every collective operation (all-reduce, all-gather, broadcast, etc.) as it executes. When it's disabled, you can see that the job crashed but not which specific operation was in flight, which rank caused the hang, or what the last successful collective was before the failure. Without this trace, NCCL timeout failures are a black box.

Setting the two environment variables below enables the flight recorder and configures PyTorch to automatically dump the trace to disk the moment a timeout is detected. That way, if the failure reproduces, you have the full operation sequence captured and preserved — even if the process exits abruptly.

Prerequisites

  • PyTorch With NCCL Backend (torch.distributed)
  • Ability to Set Environment Variables in Your Job Launch Script or Container

Instructions

  1. Set the NCCL Debug Environment Variables
    • Add the following to your job launch script, torchrun command, or container environment before starting training:

      export TORCH_NCCL_TRACE_BUFFER_SIZE=2097152
      export TORCH_NCCL_DUMP_ON_TIMEOUT=1
    • These variables take effect on the next run. No training code changes are required.
      • TORCH_NCCL_TRACE_BUFFER_SIZE — Sets the size of the in-memory buffer (in bytes) used to record NCCL operations. A value of 2097152 (2 MB) is sufficient for most workloads. If you have very long runs with many collectives, you can increase this further.
      • TORCH_NCCL_DUMP_ON_TIMEOUT — Tells PyTorch to automatically dump the flight recorder trace to disk when a collective timeout is detected, so the data is preserved even after the process crashes. Set to 1 to enable.
  2. Collect the Debug Output on Failure
    • If a timeout occurs, PyTorch writes a debug dump to the working directory of the process (typically named nccl_trace_*.json or similar, depending on your PyTorch version). Share the following with Crusoe Support when opening a ticket:
      • The dump file produced by TORCH_NCCL_DUMP_ON_TIMEOUT
      • The full NCCL log output (set NCCL_DEBUG=INFO if not already enabled)
      • The rank and node where the timeout was first reported

Example

A customer running a distributed training job across 8 H100 nodes sees the job fail with an NCCL timeout after several hours. The error message includes the flight recorder notice, but no stack trace. They set TORCH_NCCL_TRACE_BUFFER_SIZE=2097152 and TORCH_NCCL_DUMP_ON_TIMEOUT=1 in their torchrun launch script and rerun. When the timeout reproduces, a dump file is written to the working directory showing that rank 3 on node gpu-worker-7 was stuck in an all-reduce operation while the other ranks were waiting at a subsequent barrier. The customer attaches this dump, the NCCL debug logs, and the affected node name to a Crusoe support ticket, and the networking team is able to trace the stall to a specific InfiniBand link.

Related Articles

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Related Articles

Recently Viewed

Comments

0 comments

Article is closed for comments.