Capturing an NVIDIA Bug Report
Nvidia provides debugging and error handling tools across their driver and software stack. These logs can help Crusoe teams ensure a speedy path to resolution. In all cases where a GPU error has occurred, it is important to retrieve an NVIDIA bug report, a query of nvidia-smi
, and any Xid
errors from dmesg logs. You can get the logs by running:
sudo nvidia-bug-report.sh
nvidia-smi -q -d ECC
dmesg | grep Xid
If the bug report hangs, there might be a communication error on the NVIDIA driver itself in which case the client tools cannot communicate with the nvidia.ko
kernel driver. If this is the case run the command with --safe-mode
.
sudo nvidia-bug-report.sh --safe-mode
The NVLink and NVSwith layer have their own SXid
error code stack. You can find Nvidia’s full documentation on their Fabric Manager here.
Known Failure Modes
There are a few known failure modes which automatically qualify the GPU to be degraded and replaced, which can be determined from the nvidia-bug-report.log
.
Any uncorrectable ECC errors in SRAM in either Volatile
or Aggregate
> 0:
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable : 1 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable : 2 <-- known failure mode
DRAM Correctable : 0
DRAM Uncorrectable : 0
A row remapping failure occurred with no Pending
Remapped rows:
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : Yes <-- known failure mode
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 1 bank(s)
If you find these errors, please submit the logs with a ticket in order for Crusoe teams to address and replace.
Row Remapping is Pending
This is a special case, where an error has occurred and the GPU is waiting to perform a row remapping event. The output of the nvidia-bug-report
may look like:
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 1
Pending : Yes <-- Notable
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 639 bank(s)
High : 0 bank(s)
Partial : 1 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
To fix this, reset the GPU by running:
nvidia-smi -r
Once reset, reboot the instance to ensure the Row Remapping was successful. You should see the Pending
return to “No” and Remapping Failure Occurred
also return to “No”.
If the issue still persists, attach the nvidia-bug-report to the support ticket.
Comments
0 comments
Article is closed for comments.