Uncorrectable ECC Error Detected: GPU Requires Reset

Last Updated: Oct 20, 2025

Overview

ECC errors occur when the GPU's memory encounters issues that it can't correct on its own. These errors are typically non-fatal but can disrupt normal GPU operations. They might be caused by various factors such as hardware malfunctions, firmware issues, or transient faults in the memory. In severe cases, uncorrectable ECC errors might indicate a more serious hardware fault.

The symptoms you might observe include:

Unknown Errors: nvidia-smi may show "Unknown Error" for GPU Link Info and ECC Mode, indicating that the GPU is experiencing issues that it can't properly report or manage.
Uncorrectable ECC Errors: dmesg logs might display messages about uncorrectable ECC errors, which could be indicative of problems that require attention.
Service Failures: The NVIDIA Fabric Manager service might fail to start, impacting system operations that rely on NVIDIA's GPU infrastructure.

Symptoms

NVIDIA-SMI is showing "ERR! ERR! ERR!" for one or more GPUs.

Querying NVIDIA-SMI, the current "GPU Link Info" shows "Unkown Error", "ECC Mode" shows "GPU requires reset" and the "ECC Errors" counters show "N/A".

$ nvidia-smi -q
...
GPU 00000003:00:05.0
    PCI
        Bus : 0x00
        Device : 0x05
        Domain : 0x0003
        Device Id : 0x20B510DE
        Bus Id : 00000003:00:05.0
        Sub System Id : 0x153310DE
        GPU Link Info
            PCIe Generation
                Max : Unknown Error
                Current : Unknown Error
                Device Current : Unknown Error
                Device Max : Unknown Error
                Host Max : N/A
...
    ECC Mode
        Current : GPU requires reset
        Pending : GPU requires reset
     ECC Errors
         Volatile
             SRAM Correctable : N/A
             SRAM Uncorrectable : N/A
             DRAM Correctable : N/A
             DRAM Uncorrectable : N/A
         Aggregate
             SRAM Correctable : N/A
             SRAM Uncorrectable : N/A
             DRAM Correctable : N/A

Dmesg shows "An uncorrectable ECC error detected".

NVRM: Xid (PCI:00003:00:05): 120, pid='<unknown>', name=<unknown>, GSP task exception: environment call from U-mode (cause:0x8) @ pc:0x568e018, task:1
NVMR: Xid (PCI:0003:00:05): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0

The nvidia-fabricmanager.service systemd unit is failing to start.

/usr/bin/systemctl status nvidia-fabricmanager.service
● nvidia-fabricmanager.service - NVIDIA fabric manager service
Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2024-04-19 17:11:15 UTC; 3 months 11 days ago

Apr 19 17:11:14 vm.us-northcentral1-a.compute.internal systemd[1]: Starting NVIDIA fabric manager service...
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal nv-fabricmanager[2189]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: Failed to start NVIDIA fabric manager service.

Resolution

Generally, DRAM correctable and uncorrectable ECC errors are non-fatal to Nvidia GPUs, and may be resolved by an NVIDIA-SMI reset and/or rebooting the VM OS.

Step 1: NVIDIA-SMI Reset

This command triggers a reset of one or more GPUs. It can be used to clear GPU HW and SW state in situations that would otherwise require a machine reboot. Typically useful if a double bit ECC error has occurred.
```
$ nvidia-smi -r
```
Step 2: Reboot OS

Reboot the VM from within the OS.
```
$ reboot now
```
Step 3: Submit a Support Request

If the issue persists after rebooting the VM, please Submit a Support Request and our team will take care to address the hardware-related issue.

Note: Please be sure to also include the nvidia-smi -q output, as well as the contents of dmesg.

Additional Resources

NVIDIA GPU Debug Guidelines

Related to

ECC solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Symptoms

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

Uncorrectable ECC Error Detected: GPU Requires Reset

Overview

Symptoms

Resolution

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments