Introduction
Nvidia provides debugging and error handling tools across their driver and software stack. These logs can help Crusoe teams ensure a speedy path to resolution. In all cases where a GPU error has occurred, it is important to retrieve an NVIDIA bug report. This article will walk you through the steps to capture these logs if you are using the NVIDIA GPU operator on a Kubernetes cluster.
Prerequisites
- Kubeconfig
Step-by-Step Instructions
-
Step 1: Find the nvidia-driver-daemonset pod running on the affected node
-
# kubectl get pods -A -o wide | grep <node_name> | grep nvidia-driver-daemonset
-
-
Step 2: Run the nvidia-bug-report.sh script within the pod
-
# kubectl -n nvidia-gpu-operator exec -it nvidia-driver-daemonset-xxxxx -- nvidia-bug-report.sh
-
-
Step 3: Copy the generated log file to your local machine
-
# kubectl -n nvidia-gpu-operator cp nvidia-driver-daemonset-xxxxx:nvidia-bug-report.log.gz ./nvidia-bug-report.log.gz
-
Example
Additional Resources
nvidia-bug-report.sh is a script provided by NVIDIA with their drivers to capture the relevant logs for troubleshooting issues with NVIDIA devices.
Comments
0 comments
Article is closed for comments.