Introduction
ℹ️ Note: If the Crusoe Watch Agent is installed on your VM (version vm-v1.0.3 or later), the preferred method for capturing NVIDIA logs is via the Crusoe Cloud Console. See How-To Capture NVIDIA Logs via Command Center for the recommended path.
This article covers manual log capture for standalone VMs where the Crusoe Watch Agent is not installed, or as a fallback when in-console generation is unavailable.
When a GPU error occurs, three data sources are needed for triage: the NVIDIA bug report (a comprehensive snapshot of driver, hardware, and kernel state), the ECC error counters from nvidia-smi, and Xid error codes from the kernel log. Each targets a different layer of the NVIDIA stack and can fail independently — capturing all three gives Crusoe teams what they need to diagnose and resolve hardware issues quickly.
Prerequisites
- SSH Access to the Crusoe VM
Instructions
Step 1: Capture the NVIDIA Bug Report, ECC State, and Xid Errors
In all cases where a GPU error has occurred, capture the NVIDIA bug report, query ECC state, and retrieve Xid errors from the kernel log:
sudo nvidia-bug-report.sh nvidia-smi -q -d ECC dmesg | grep Xid
ℹ️ Note: If
nvidia-bug-report.shhangs, there may be a communication failure between the NVIDIA client tools and thenvidia.kokernel driver. In this case, run with--safe-modeto bypass the hung driver interface:
sudo nvidia-bug-report.sh --safe-mode
Step 2: Check for NVSwitch / Fabric Layer Errors
The NVLink and NVSwitch fabric layer uses its own SXid error code stack, separate from the GPU-side Xid codes. SXid errors are reported through NVIDIA Fabric Manager and indicate fabric-level failures rather than per-GPU hardware faults. You can find NVIDIA's full Fabric Manager documentation, including the SXid code reference, here.
Step 3: Submit Logs to Crusoe Support
Attach the captured logs to a support ticket. For guidance on interpreting the bug report output — including ECC error counts and row remapping state — see How-To Interpret NVIDIA Bug Report Output.