Skip to main content
Crusoe Support Help Center home page
Crusoe

Uncorrectable ECC Error Detected: GPU Requires Reset

Matt Roark
Matt Roark
Updated

Last Updated: Oct 20, 2025

Overview

ECC errors occur when the GPU's memory encounters issues that it can't correct on its own. These errors are typically non-fatal but can disrupt normal GPU operations. They might be caused by various factors such as hardware malfunctions, firmware issues, or transient faults in the memory. In severe cases, uncorrectable ECC errors might indicate a more serious hardware fault.

The symptoms you might observe include:

  • Unknown Errors: nvidia-smi may show "Unknown Error" for GPU Link Info and ECC Mode, indicating that the GPU is experiencing issues that it can't properly report or manage.
  • Uncorrectable ECC Errors: dmesg logs might display messages about uncorrectable ECC errors, which could be indicative of problems that require attention.
  • Service Failures: The NVIDIA Fabric Manager service might fail to start, impacting system operations that rely on NVIDIA's GPU infrastructure.

Symptoms

  • NVIDIA-SMI is showing "ERR! ERR! ERR!" for one or more GPUs.
    9883742e-25e8-449f-a2c9-242f8e0cd752.png
  • Querying NVIDIA-SMI, the current "GPU Link Info" shows "Unkown Error", "ECC Mode" shows "GPU requires reset" and the "ECC Errors" counters show "N/A".

    $ nvidia-smi -q
    ...
    GPU 00000003:00:05.0
        PCI
            Bus : 0x00
            Device : 0x05
            Domain : 0x0003
            Device Id : 0x20B510DE
            Bus Id : 00000003:00:05.0
            Sub System Id : 0x153310DE
            GPU Link Info
                PCIe Generation
                    Max : Unknown Error
                    Current : Unknown Error
                    Device Current : Unknown Error
                    Device Max : Unknown Error
                    Host Max : N/A
    ...
        ECC Mode
            Current : GPU requires reset
            Pending : GPU requires reset
         ECC Errors
             Volatile
                 SRAM Correctable : N/A
                 SRAM Uncorrectable : N/A
                 DRAM Correctable : N/A
                 DRAM Uncorrectable : N/A
             Aggregate
                 SRAM Correctable : N/A
                 SRAM Uncorrectable : N/A
                 DRAM Correctable : N/A
  • Dmesg shows "An uncorrectable ECC error detected".

    NVRM: Xid (PCI:00003:00:05): 120, pid='<unknown>', name=<unknown>, GSP task exception: environment call from U-mode (cause:0x8) @ pc:0x568e018, task:1
    NVMR: Xid (PCI:0003:00:05): 140, pid='<unknown>', name=<unknown>, An uncorrectable ECC error detected (possible firmware handling failure) DRAM:-1840691462, LTC:0, MMU:0, PCIE:0
  • The nvidia-fabricmanager.service systemd unit is failing to start.

    /usr/bin/systemctl status nvidia-fabricmanager.service
    ● nvidia-fabricmanager.service - NVIDIA fabric manager service
    Loaded: loaded (/lib/systemd/system/nvidia-fabricmanager.service; enabled; vendor preset: enabled)
    Active: failed (Result: exit-code) since Fri 2024-04-19 17:11:15 UTC; 3 months 11 days ago
    
    Apr 19 17:11:14 vm.us-northcentral1-a.compute.internal systemd[1]: Starting NVIDIA fabric manager service...
    Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal nv-fabricmanager[2189]: request to query NVSwitch device information from NVSwitch driver failed with error:WARNING Nothing to do [NV_WARN_NOTHING_TO_DO]
    Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Control process exited, code=exited, status=1/FAILURE
    Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: nvidia-fabricmanager.service: Failed with result 'exit-code'.
    Apr 19 17:11:15 vm.us-northcentral1-a.compute.internal systemd[1]: Failed to start NVIDIA fabric manager service.

Resolution

Generally, DRAM correctable and uncorrectable ECC errors are non-fatal to Nvidia GPUs, and may be resolved by an NVIDIA-SMI reset and/or rebooting the VM OS.

  1. Step 1: NVIDIA-SMI Reset

    This command triggers a reset of one or more GPUs. It can be used to clear GPU HW and SW state in situations that would otherwise require a machine reboot. Typically useful if a double bit ECC error has occurred.

    $ nvidia-smi -r
  2. Step 2: Reboot OS

    Reboot the VM from within the OS.

    $ reboot now
  3. Step 3: Submit a Support Request

    If the issue persists after rebooting the VM, please Submit a Support Request and our team will take care to address the hardware-related issue.

    Note: Please be sure to also include the nvidia-smi -q output, as well as the contents of dmesg.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.