Last Updated: Oct 20, 2025
Overview
ECC errors occur when the GPU's memory encounters issues that it can't correct on its own. These errors are typically non-fatal but can disrupt normal GPU operations. They might be caused by various factors such as hardware malfunctions, firmware issues, or transient faults in the memory. In severe cases, uncorrectable ECC errors might indicate a more serious hardware fault.
The symptoms you might observe include:
-
Unknown Errors:
nvidia-smimay show "Unknown Error" for GPU Link Info and ECC Mode, indicating that the GPU is experiencing issues that it can't properly report or manage. -
Uncorrectable ECC Errors:
dmesglogs might display messages about uncorrectable ECC errors, which could be indicative of problems that require attention. - Service Failures: The NVIDIA Fabric Manager service might fail to start, impacting system operations that rely on NVIDIA's GPU infrastructure.
Symptoms
- NVIDIA-SMI is showing "ERR! ERR! ERR!" for one or more GPUs.
-
Querying NVIDIA-SMI, the current "GPU Link Info" shows "Unkown Error", "ECC Mode" shows "GPU requires reset" and the "ECC Errors" counters show "N/A".
$ nvidia-smi -q ... GPU 00000003:00:05.0 PCI Bus : 0x00 Device : 0x05 Domain : 0x0003 Device Id : 0x20B510DE Bus Id : 00000003:00:05.0 Sub System Id : 0x153310DE GPU Link Info PCIe Generation Max : Unknown Error Current : Unknown Error Device Current : Unknown Error Device Max : Unknown Error Host Max : N/A ... ECC Mode Current : GPU requires reset Pending : GPU requires reset ECC Errors Volatile SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A DRAM Uncorrectable : N/A Aggregate SRAM Correctable : N/A SRAM Uncorrectable : N/A DRAM Correctable : N/A -
Dmesg shows "An uncorrectable ECC error detected".
NVRM: Xid (PCI:00003:00:05): 120, pid='<unknown>', name=<unknown>, GSP task exception: environment call from U-mode (cause:0x8) @ pc:0x568e018, task:1 NVMR: Xid (PCI:0003:00:05): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0 -
The
nvidia-fabricmanager.servicesystemd unit is failing to start./usr/bin/systemctl status nvidia-fabricmanager.service ● nvidia-fabricmanager.service - NVIDIA fabric manager service Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Fri 2024-04-19 17:11:15 UTC; 3 months 11 days ago Apr 19 17:11:14 vm.us-northcentral1-a.compute.internal systemd[1]: Starting NVIDIA fabric manager service... Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal nv-fabricmanager[2189]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO] Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'. Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: Failed to start NVIDIA fabric manager service.
Resolution
Generally, DRAM correctable and uncorrectable ECC errors are non-fatal to Nvidia GPUs, and may be resolved by an NVIDIA-SMI reset and/or rebooting the VM OS.
-
Step 1: NVIDIA-SMI Reset
This command triggers a reset of one or more GPUs. It can be used to clear GPU HW and SW state in situations that would otherwise require a machine reboot. Typically useful if a double bit ECC error has occurred.
$ nvidia-smi -r -
Step 2: Reboot OS
Reboot the VM from within the OS.
$ reboot now -
Step 3: Submit a Support Request
If the issue persists after rebooting the VM, please Submit a Support Request and our team will take care to address the hardware-related issue.
Note: Please be sure to also include thenvidia-smi -qoutput, as well as the contents ofdmesg.