How-To Capture NVIDIA Bug Report in CMK

Introduction

ℹ️ Note: For CMK clusters running version 1.33.4-cmk.31 or later with the Crusoe Watch Agent installed, the preferred method for capturing NVIDIA bug reports is via the Crusoe Cloud Console. See How-To Capture NVIDIA Logs via Command Center for the recommended path.

This article covers manual NVIDIA bug report capture for CMK clusters where the Crusoe Watch Agent is not installed, or as a fallback when in-console generation is unavailable. On a CMK cluster running the NVIDIA GPU Operator, the nvidia-bug-report.sh script runs inside the GPU driver pod on the affected node — there is no direct host access.

In addition to the bug report, nvidia-smi provides complementary diagnostic data — GPU general status, driver and CUDA versions, and ECC error counts — that can be useful for triage. See How-To Run nvidia-smi Commands on CMK for steps on running it inside a CMK cluster.

Prerequisites

Kubeconfig Access to the CMK Cluster
CMK Cluster Deployed with nvidia-gpu-operator Add-On — Manage Your CMK Clusters

Instructions

Step 1: Find the GPU Driver Pod on the Affected Node

Identify the nvidia-gpu-driver-ubuntu pod running on the node you want to capture logs from:

kubectl get pods -n nvidia-gpu-operator -o wide | grep <node_name> | grep nvidia-gpu-driver-ubuntu

Note the full pod name — you'll need it in the next steps.

Step 2: Run the Bug Report Script Inside the Pod

kubectl -n nvidia-gpu-operator exec -it nvidia-gpu-driver-ubuntu<version>-<id> -- nvidia-bug-report.sh

ℹ️ Note: If the script hangs, there may be a communication failure between the NVIDIA client tools and the nvidia.ko kernel driver. Run with --safe-mode:

kubectl -n nvidia-gpu-operator exec -it nvidia-gpu-driver-ubuntu<version>-<id> -- nvidia-bug-report.sh --safe-mode

Step 3: Copy the Log File to Your Local Machine

kubectl -n nvidia-gpu-operator cp nvidia-gpu-driver-ubuntu<version>-<id>:nvidia-bug-report.log.gz ./nvidia-bug-report.log.gz

Step 4: Submit Logs to Crusoe Support

Attach nvidia-bug-report.log.gz to a support ticket. For guidance on interpreting the output — including ECC error counts and row remapping state — see How-To Interpret NVIDIA Bug Report Output.

Example

Your GPUs on a specific CMK node are returning errors during training runs. You've identified the affected node as np-7ad4289e-2 and your cluster is running an older CMK version without the Crusoe Watch Agent installed.

First, find the driver pod running on that node:

kubectl get pods -n nvidia-gpu-operator -o wide | grep np-7ad4289e-2 | grep nvidia-gpu-driver-ubuntu

This returns nvidia-gpu-driver-ubuntu22.04-6865f88d94-bf6hj. Run the bug report inside it:

kubectl exec -it -n nvidia-gpu-operator nvidia-gpu-driver-ubuntu22.04-6865f88d94-bf6hj -- nvidia-bug-report.sh

Expected output:

nvidia-bug-report.sh will now collect information about your
system and create the file 'nvidia-bug-report.log.gz' in the current
directory. It may take several seconds to run. In some
cases, it may hang trying to capture data generated dynamically
by the Linux kernel and/or the NVIDIA kernel module. While
the bug report log file will be incomplete if this happens, it
may still contain enough data to diagnose your problem.

If nvidia-bug-report.sh hangs, consider running with the --safe-mode
and --extra-system-data command line arguments.

Please include the 'nvidia-bug-report.log.gz' log file when reporting
your bug via the NVIDIA Linux forum (see forums.developer.nvidia.com)
or by sending email to 'linux-bugs@nvidia.com'.

By delivering 'nvidia-bug-report.log.gz' to NVIDIA, you acknowledge
and agree that personal information may inadvertently be included in
the output. Notwithstanding the foregoing, NVIDIA will use the
output only for the purpose of investigating your reported issue.

Running nvidia-bug-report.sh... complete.

Copy the file locally and verify it was written successfully:

kubectl -n nvidia-gpu-operator cp nvidia-gpu-driver-ubuntu22.04-6865f88d94-bf6hj:nvidia-bug-report.log.gz ./nvidia-bug-report.log.gz

ls -lrt nvidia-bug-report.log.gz
-rw-r-----@ 1 <user> <group> 6810882 2 Oct 19:42 nvidia-bug-report.log.gz

Attach the file to a support ticket so Crusoe teams can triage the node.

Additional Resources

Related to

how-to nvidia nvidia-bug-report bug report kubernetes cmk

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

Instructions

Step 1: Find the GPU Driver Pod on the Affected Node

Step 2: Run the Bug Report Script Inside the Pod

Step 3: Copy the Log File to Your Local Machine

Step 4: Submit Logs to Crusoe Support

Example

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To Capture NVIDIA Bug Report in CMK

Introduction

Prerequisites

Instructions

Step 1: Find the GPU Driver Pod on the Affected Node

Step 2: Run the Bug Report Script Inside the Pod

Step 3: Copy the Log File to Your Local Machine

Step 4: Submit Logs to Crusoe Support

Example

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments