Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Interpret NVIDIA Bug Report Output

Matt Roark
Matt Roark
Updated

Introduction

The NVIDIA bug report captures a wide snapshot of driver, hardware, and kernel state — but for GPU hardware triage, two sections are the most critical: ECC error counters and row remapping status. These tell you whether a GPU has experienced uncorrectable memory failures, whether it has successfully recovered via row remapping, or whether it needs to be flagged for replacement.

This article covers how to read those sections, what the failure thresholds are, and what action to take in each case. It applies regardless of whether the bug report was captured via Command Center or manually.

Prerequisites

Known Failure Modes

The following conditions automatically qualify a GPU for degradation and replacement. Both are determined from the ECC and row remapping sections of the bug report output.

Uncorrectable ECC SRAM Errors

Any uncorrectable ECC error in SRAM — in either the Volatile or Aggregate counter — is a known failure mode. SRAM uncorrectable errors cannot be remapped and indicate permanent hardware damage:

ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 1 <-- known failure mode
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 2 <-- known failure mode
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0

If either counter is greater than 0, the GPU should be replaced. Submit the logs via a support ticket.

Row Remapping Failure with No Pending Remap

A row remapping failure with Pending: No means the GPU attempted to remap a failing DRAM row but has no pending remap queued to recover with. This is not recoverable without replacement:

Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : Yes <-- known failure mode
        Bank Remap Availability Histogram
            Max                           : 639 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 1 bank(s)

Submit the logs via a support ticket.

Row Remapping Pending — Recoverable Path

This is a distinct, recoverable case. The GPU has encountered an error and queued a row remapping event, but the remap has not executed yet. It requires a GPU reset followed by a full VM reboot to complete:

Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 1
        Pending                           : Yes <-- Notable
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 639 bank(s)
            High                          : 0 bank(s)
            Partial                       : 1 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)

Reset the GPU:

nvidia-smi -r

Then reboot the VM. The reset triggers the pending remap, but a full reboot is required for it to be committed and reflected in the output.

⚠️ Warning: A GPU reset alone is not sufficient. The VM must be fully rebooted after the reset for the row remapping to complete. Checking the output before rebooting will still show Pending: Yes.

After reboot, verify that both Pending and Remapping Failure Occurred have returned to No. If either is still non-zero, attach the bug report to a support ticket.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.