Overview
You are observing XID errors within the dmesg logs within a VM and want to determine the remediation steps for the different XID errors.
Prerequisites
XID Errors and Solution
XID Error |
Solution
|
XID 13 |
- dmesg logs Error Message:
NVRM: Xid (PCI:0003:00:04): 13, pid='<unknown>', name=<unknown>, Graphics SM Warp Exception on (GPC 0, TPC 7, SM 1): Out Of Range Address
- Stop and Start the VM to see if the issue gets resolved.
- Debug the application using
cuda-gdb or the Compute Sanitizer memcheck tool.
- Run the application with
CUDA_DEVICE_WAITS_ON_EXCEPTION=1 and then attach later with cuda-gdb .
- Run the application again. If you're still noticing XID 13 errors, generate an NVIDIA bug report.
- Reach out to Crusoe Support and provide the bug report.
|
XID
48
|
- dmesg logs Error Message:
NVRM: Xid (PCI:0003:00:03): 48, pid='<unknown>', name=<unknown>, An uncorrectable double bit error (DBE) has been detected on GPU in the L2 cache at cache 0, slice 2.
According to NVIDIA's documentation: "This event is logged when the GPU detects that an uncorrectable error occurs on the GPU. This is also reported back to the user application. A GPU reset or node reboot is needed to clear this error."
- Perform a GPU reset using the following command:
# nvidia-smi -r
- If the issue persists, perform a VM reset using the following command:
# crusoe compute vms reset <vm-name>
- If the issue persists, STOP and Start the VM
- If the issue persists after following the above steps, generate an NVIDIA bug report
- Reach out to Crusoe Support and provide the bug report
|
XID 79 |
- dmesg logs Error Message:
NVRM: Xid (PCI:0000:14:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
- Stop and Start the VM to see if the issue gets resolved.
- Generate an NVIDIA bug report
- Reach out to Crusoe Support and provide the bug report.
- If you have spare host availability, proceed to stop and start the VM.
- If you do not have any spare capacity, please proceed to shut down the instance for maintenance and let us know in the support ticket.
|
XID
95
|
dmesg logs Error Message:
NVRM: Xid (PCI:0002:00:04): 95, pid='<unknown>', name=<unknown>, Uncontained: FBHUB. RST: Yes, D-RST: No
- Perform a GPU reset using the following command:
# nvidia-smi -r
- If the issue persists, perform a VM reset using the following command:
# crusoe compute vms reset <vm-name>
- If the issue persists, STOP and Start the VM
- If the issue persists after following the above steps, generate an NVIDIA bug report
- Reach out to Crusoe Support and provide the bug report
|
XID 119
|
- dmesg logs Error Message:
NVRM: Xid (PCI:0003:00:04): 119, pid=2009566, name=nvidia-smi, Timeout after 6s of waiting for RPC response from GPU7 GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
- Generate an NVIDIA bug report
-
Create the file /etc/modprobe.d/nvidia.conf
-
Add options nvidia NVreg_EnableGpuFirmware=0 to the file
-
Update the kernel images update-initramfs -u -k all
|
Additional Resources
https://docs.nvidia.com/deploy/xid-errors/index.html
Comments
0 comments
Article is closed for comments.