Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Detect Silent Data Corruption in Distributed Training Jobs

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Introduction

Silent data corruption (SDC) occurs when a hardware fault causes subtly incorrect data to flow through your training pipeline without triggering a hard failure. The job keeps running, loss curves look plausible for a while, and then training diverges or collapses with no error to point to.

SDC is particularly hard to diagnose because standard GPU diagnostics do not reproduce the conditions that cause it. A clean dcgmi diag -r 3 pass does not rule it out. This guide walks through how to isolate the affected node, collect relevant hardware signals, and confirm whether SDC is occurring.

This guide focuses on NVSwitch/NVLink ECC errors as a potential hardware signal. If your node shows no NVSwitch events in Step 2, SDC may still be present. In such cases, open a support ticket with your reproducibility test results and Crusoe Support will investigate further.

Prerequisites

  • Root or sudo Access to the Affected VMs
  • Access to VM Kernel Logs (/var/log/messages or dmesg)
  • NVIDIA Bug Report (If Already Collected via nvidia-bug-report.sh)

Instructions

  1. Isolate the Affected Node
    • If you are running a multi-node job and suspect only one node is causing divergence, run the same job across different node combinations and observe which grouping consistently produces divergent results.
    • A node-by-node ablation — swapping one node at a time — is the most reliable way to pinpoint the offending node before pulling logs.
  2. Check for Hardware Error Signals
    • Once you have a suspect node, check for NVSwitch ECC rate-limit events, which may indicate marginal NVLink performance under collective traffic load.
    • These errors surface in two places:
      • Kernel logs — run on the affected VM:

        grep -E "SXid.*1202[13]" /var/log/messages

        Or using dmesg:

        dmesg | grep -E "SXid.*1202[13]"
      • NVIDIA bug report — if you have already collected a bug report via nvidia-bug-report.sh, search the output for SXids like 12021 or 12023. The bug report aggregates kernel messages, NVSwitch event logs, and driver state in one place and may surface events not visible in a plain dmesg snapshot.
    • Note the affected NVSwitch device and link number. Bidirectional errors (both SXid 12021 and 12023) on the same link are a stronger signal than either alone.
    • Cross-reference error timestamps against your training job schedule to see whether the errors coincide with active collective workloads.
    • ℹ️ Note: These events are labeled "Non-fatal" by the NVIDIA driver and are correctable under normal conditions. Their presence alone does not confirm SDC, but rate-limit events under sustained load warrant further investigation.

  3. Run a Reproducibility Test
    • Run a single-node 8-GPU model training job (any basic 8B or 70B model on a synthetic dataset) 2–3 times and compare the loss curves across runs.
    • Indicators that suggest SDC:
      • Loss curves diverge between runs that should be deterministic
      • NaN values appear in loss or gradient metrics
    • This test is the most reliable way to distinguish SDC from application-level issues when hardware diagnostics return clean.
  4. Open a Support Ticket
    • If hardware error signals and training divergence point to SDC, open a Crusoe support ticket and include:
      • Kernel log excerpt with any SXid events and timestamps
      • NVSwitch device and link number
      • Training job configuration (node count, framework, collective library)
      • Loss curve comparisons from Step 3
      • Output of dcgmi diag -r 3
    • Crusoe Support will investigate the node and determine next steps, which may include draining the host and reserving a spare.

Example

A 23-node distributed training job consistently produces divergent loss curves whenever a specific node is included in the run, but the job never errors or crashes. A node-by-node ablation isolates the offending node. Kernel logs on that node show SXid 12021 and 12023 events on the same NVLink port, with timestamps matching earlier training runs. A single-node reproducibility test on the isolated node produces inconsistent loss curves across runs, confirming SDC. The node is escalated to Crusoe Support for hardware investigation.

Related Articles

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Related Articles

Recently Viewed

Comments

0 comments

Article is closed for comments.