Introduction
If you've run the command nvidia-smi
and it appears your instance is missing a GPU, this typically indicates an underlying hardware issue—such as a dropped PCIe link, a failing GPU, or a driver-related problem. These issues can result in degraded performance or, in some cases, cause operations to hang. This article outlines the troubleshooting steps to follow when your instance appears to be missing a GPU.
Prerequisite
- Crusoe Cloud account
- Access to the Crusoe Cloud UI or Crusoe CLI
Instructions
- As a first step, it is recommended to immediately take a backup of any important data you don't wish to lose as the instance can become unresponsive.
- Run the following command to clear any ECC errors that could also be affecting the instance followed by a reboot:
-
$ nvidia-smi -r
$ reboot now
-
- If you're still seeing a missing GPU in the instance, generate the Nvidia bug report:
-
sudo nvidia-bug-report.sh
- Refer to this article for further assistance on pulling this report: How-To Capture NVIDIA Logs
- Transfer this file to your host machine using tools such as scp or rsync
- This will be needed when creating a ticket with Crusoe Support
-
- The next step would be to perform a VM reset or a STOP / START operation. (A VM reset will still preserve any ephemeral data stored on your instance, however, it is best practice to backup any important data in case the operation hangs. A STOP operation on the VM will however delete all ephemeral data stored in the instance.)
-
$ crusoe compute vms reset <vm-name>
- Note: If the underlying host is experiencing any hardware related issues, this operation will hang resulting in the instance becoming inaccessible.
-
$ crusoe compute vms stop <vm-name>
&$ crusoe compute vms start <vm-name>
-
- If following the above steps does not resolve the issue, please reach out to Crusoe Support and provide the bug report.
Additional Resources:
FAQ
1. Why is my instance missing a GPU?
Answer: This event is logged when the GPU driver attempts to access the GPU and finds that the GPU is not accessible.
2. Why was I not informed by Crusoe that the GPU is missing?
Answer: In most cases, if an instance is experiencing a true hardware-related issue, Crusoe’s monitoring and alerting systems will detect it, and a Cloud Support Engineer will proactively reach out to notify you. However, some issues may only be observable from within the virtual machine (VM) itself. Since Crusoe does not have visibility inside the VM, they are not automatically alerted to these types of scenarios.
Comments
0 comments
Article is closed for comments.