Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Capture NVIDIA Bug Report

Matt Roark
Matt Roark
Updated

Introduction

ℹ️ Note: If the Crusoe Watch Agent is installed on your VM (version vm-v1.0.3 or later), the preferred method for capturing NVIDIA logs is via the Crusoe Cloud Console. See How-To Capture NVIDIA Logs via Command Center for the recommended path.

This article covers manual log capture for standalone VMs where the Crusoe Watch Agent is not installed, or as a fallback when in-console generation is unavailable.

When a GPU error occurs, three data sources are needed for triage: the NVIDIA bug report (a comprehensive snapshot of driver, hardware, and kernel state), the ECC error counters from nvidia-smi, and Xid error codes from the kernel log. Each targets a different layer of the NVIDIA stack and can fail independently — capturing all three gives Crusoe teams what they need to diagnose and resolve hardware issues quickly.

Prerequisites

  • SSH Access to the Crusoe VM

Instructions

Step 1: Capture the NVIDIA Bug Report, ECC State, and Xid Errors

In all cases where a GPU error has occurred, capture the NVIDIA bug report, query ECC state, and retrieve Xid errors from the kernel log:

sudo nvidia-bug-report.sh
nvidia-smi -q -d ECC
dmesg | grep Xid

ℹ️ Note: If nvidia-bug-report.sh hangs, there may be a communication failure between the NVIDIA client tools and the nvidia.ko kernel driver. In this case, run with --safe-mode to bypass the hung driver interface:

 
sudo nvidia-bug-report.sh --safe-mode

Step 2: Check for NVSwitch / Fabric Layer Errors

The NVLink and NVSwitch fabric layer uses its own SXid error code stack, separate from the GPU-side Xid codes. SXid errors are reported through NVIDIA Fabric Manager and indicate fabric-level failures rather than per-GPU hardware faults. You can find NVIDIA's full Fabric Manager documentation, including the SXid code reference, here.

Step 3: Submit Logs to Crusoe Support

Attach the captured logs to a support ticket. For guidance on interpreting the bug report output — including ECC error counts and row remapping state — see How-To Interpret NVIDIA Bug Report Output.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.