Introduction
ℹ️ Note: For CMK clusters running version 1.33.4-cmk.31 or later with the Crusoe Watch Agent installed, the preferred method for capturing NVIDIA bug reports is via the Crusoe Cloud Console. See How-To Capture NVIDIA Logs via Command Center for the recommended path.
This article covers manual NVIDIA bug report capture for CMK clusters where the Crusoe Watch Agent is not installed, or as a fallback when in-console generation is unavailable. On a CMK cluster running the NVIDIA GPU Operator, the nvidia-bug-report.sh script runs inside the GPU driver pod on the affected node — there is no direct host access.
In addition to the bug report, nvidia-smi provides complementary diagnostic data — GPU general status, driver and CUDA versions, and ECC error counts — that can be useful for triage. See How-To Run nvidia-smi Commands on CMK for steps on running it inside a CMK cluster.
Prerequisites
- Kubeconfig Access to the CMK Cluster
- CMK Cluster Deployed with
nvidia-gpu-operatorAdd-On — Manage Your CMK Clusters
Instructions
Step 1: Find the GPU Driver Pod on the Affected Node
Identify the nvidia-gpu-driver-ubuntu pod running on the node you want to capture logs from:
kubectl get pods -n nvidia-gpu-operator -o wide | grep <node_name> | grep nvidia-gpu-driver-ubuntu
Note the full pod name — you'll need it in the next steps.
Step 2: Run the Bug Report Script Inside the Pod
kubectl -n nvidia-gpu-operator exec -it nvidia-gpu-driver-ubuntu<version>-<id> -- nvidia-bug-report.sh
ℹ️ Note: If the script hangs, there may be a communication failure between the NVIDIA client tools and the
nvidia.kokernel driver. Run with--safe-mode:
kubectl -n nvidia-gpu-operator exec -it nvidia-gpu-driver-ubuntu<version>-<id> -- nvidia-bug-report.sh --safe-mode
Step 3: Copy the Log File to Your Local Machine
kubectl -n nvidia-gpu-operator cp nvidia-gpu-driver-ubuntu<version>-<id>:nvidia-bug-report.log.gz ./nvidia-bug-report.log.gz
Step 4: Submit Logs to Crusoe Support
Attach nvidia-bug-report.log.gz to a support ticket. For guidance on interpreting the output — including ECC error counts and row remapping state — see How-To Interpret NVIDIA Bug Report Output.
Example
Your GPUs on a specific CMK node are returning errors during training runs. You've identified the affected node as np-7ad4289e-2 and your cluster is running an older CMK version without the Crusoe Watch Agent installed.
First, find the driver pod running on that node:
kubectl get pods -n nvidia-gpu-operator -o wide | grep np-7ad4289e-2 | grep nvidia-gpu-driver-ubuntu
This returns nvidia-gpu-driver-ubuntu22.04-6865f88d94-bf6hj. Run the bug report inside it:
kubectl exec -it -n nvidia-gpu-operator nvidia-gpu-driver-ubuntu22.04-6865f88d94-bf6hj -- nvidia-bug-report.sh
Expected output:
nvidia-bug-report.sh will now collect information about your system and create the file 'nvidia-bug-report.log.gz' in the current directory. It may take several seconds to run. In some cases, it may hang trying to capture data generated dynamically by the Linux kernel and/or the NVIDIA kernel module. While the bug report log file will be incomplete if this happens, it may still contain enough data to diagnose your problem. If nvidia-bug-report.sh hangs, consider running with the --safe-mode and --extra-system-data command line arguments. Please include the 'nvidia-bug-report.log.gz' log file when reporting your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com) or by sending email to 'linux-bugs@nvidia.com'. By delivering 'nvidia-bug-report.log.gz' to NVIDIA, you acknowledge and agree that personal information may inadvertently be included in the output. Notwithstanding the foregoing, NVIDIA will use the output only for the purpose of investigating your reported issue. Running nvidia-bug-report.sh... complete.
Copy the file locally and verify it was written successfully:
kubectl -n nvidia-gpu-operator cp nvidia-gpu-driver-ubuntu22.04-6865f88d94-bf6hj:nvidia-bug-report.log.gz ./nvidia-bug-report.log.gz ls -lrt nvidia-bug-report.log.gz -rw-r-----@ 1 <user> <group> 6810882 2 Oct 19:42 nvidia-bug-report.log.gz
Attach the file to a support ticket so Crusoe teams can triage the node.