Introduction
The NVIDIA bug report captures a wide snapshot of driver, hardware, and kernel state — but for GPU hardware triage, two sections are the most critical: ECC error counters and row remapping status. These tell you whether a GPU has experienced uncorrectable memory failures, whether it has successfully recovered via row remapping, or whether it needs to be flagged for replacement.
This article covers how to read those sections, what the failure thresholds are, and what action to take in each case. It applies regardless of whether the bug report was captured via Command Center or manually.
Prerequisites
- NVIDIA Bug Report Captured — See How-To Capture NVIDIA Logs via Command Center (preferred) or How-To Capture NVIDIA Logs (manual)
Known Failure Modes
The following conditions automatically qualify a GPU for degradation and replacement. Both are determined from the ECC and row remapping sections of the bug report output.
Uncorrectable ECC SRAM Errors
Any uncorrectable ECC error in SRAM — in either the Volatile or Aggregate counter — is a known failure mode. SRAM uncorrectable errors cannot be remapped and indicate permanent hardware damage:
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 1 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 2 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0If either counter is greater than 0, the GPU should be replaced. Submit the logs via a support ticket.
Row Remapping Failure with No Pending Remap
A row remapping failure with Pending: No means the GPU attempted to remap a failing DRAM row but has no pending remap queued to recover with. This is not recoverable without replacement:
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : Yes <-- known failure mode
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 1 bank(s)Submit the logs via a support ticket.
Row Remapping Pending — Recoverable Path
This is a distinct, recoverable case. The GPU has encountered an error and queued a row remapping event, but the remap has not executed yet. It requires a GPU reset followed by a full VM reboot to complete:
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 1
Pending : Yes <-- Notable
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 1 bank(s)
Low : 0 bank(s)
None : 0 bank(s)Reset the GPU:
nvidia-smi -r
Then reboot the VM. The reset triggers the pending remap, but a full reboot is required for it to be committed and reflected in the output.
⚠️ Warning: A GPU reset alone is not sufficient. The VM must be fully rebooted after the reset for the row remapping to complete. Checking the output before rebooting will still show
Pending: Yes.
After reboot, verify that both Pending and Remapping Failure Occurred have returned to No. If either is still non-zero, attach the bug report to a support ticket.