Overview
ECC errors occur when the GPU's memory encounters issues that it can't correct on its own. These errors are typically non-fatal but can disrupt normal GPU operations. They might be caused by various factors such as hardware malfunctions, firmware issues, or transient faults in the memory. In severe cases, uncorrectable ECC errors might indicate a more serious hardware fault.
The symptoms you might observe include:
-
Unknown Errors:
nvidia-smi
may show "Unknown Error" for GPU Link Info and ECC Mode, indicating that the GPU is experiencing issues that it can't properly report or manage. -
Uncorrectable ECC Errors:
dmesg
logs might display messages about uncorrectable ECC errors, which could be indicative of problems that require attention. - Service Failures: The NVIDIA Fabric Manager service might fail to start, impacting system operations that rely on NVIDIA's GPU infrastructure.
Symptoms
- NVIDIA-SMI is showing "ERR! ERR! ERR!" for one or more GPUs.
- Querying NVIDIA-SMI, the current "GPU Link Info" shows "Unkown Error", "ECC Mode" shows "GPU requires reset" and the "ECC Errors" counters show "N/A".
$ nvidia-smi -q
...
GPU 00000003:00:05.0
PCI
Bus : 0x00
Device : 0x05
Domain : 0x0003
Device Id : 0x20B510DE
Bus Id : 00000003:00:05.0
Sub System Id : 0x153310DE
GPU Link Info
PCIe Generation
Max : Unknown Error
Current : Unknown Error
Device Current : Unknown Error
Device Max : Unknown Error
Host Max : N/A
...
ECC Mode
Current : GPU requires reset
Pending : GPU requires reset
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A - Dmesg shows "An uncorrectable ECC error detected".
NVRM: Xid (PCI:00003:00:05): 120, pid='<unknown>', name=<unknown>, GSP task exception: environment call from U-mode (cause:0x8) @ pc:0x568e018, task:1
NVMR: Xid (PCI:0003:00:05): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0 - The nvidia-fbaricmanager.service systemd unit is failing to start.
/usr/bin/systemctl status nvidia-fabricmanager.service
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2024-04-19 17:11:15 UTC; 3 months 11 days ago
Apr 19 17:11:14 vm.us-northcentral1-a.compute.internal systemd[1]: Starting NVIDIA fabric manager service...
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal nv-fabricmanager[2189]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: Failed to start NVIDIA fabric manager service.
Resolution
Generally, DRAM correctable and uncorrectable ECC errors are non-fatal to Nvidia GPUs, and may be resolved by an NVIDIA-SMI reset and/or rebooting the VM OS.
-
Step 1: NVIDIA-SMI Reset
"Trigger a reset of one or more GPUs. Can be used to clear GPU HW and SW state in situations that would otherwise require a machine reboot. Typically useful if a double bit ECC error has occurred."$ nvidia-smi -r
-
Step 2: Reboot OS
Reboot the VM from within the OS.$ reboot now
-
Step 3: Submit a Support Request
If the issue persists after rebooting the VM, please Submit a Support Request and our team will take care to address the hardware-related issue.
Please be sure to also include the nvidia-smi -q output, as well as the contents of dmesg.
Comments
0 comments
Article is closed for comments.