How-To Collect Debugging Information on AMD GPU Nodes

Introduction

This article shows you how to collect GPU debugging information on AMD Instinct nodes (MI300X, MI355X) so it can be attached to a support ticket. It covers two environments: standalone AMD VMs and CMK NodePool VMs running the AMD GPU Operator. It is the AMD counterpart to the NVIDIA workflow (nvidia-smi, nvidia-bug-report.sh, dmesg).

On AMD nodes, the tool for this is amd-smi (the AMD System Management Interface), which ships with ROCm.

Prerequisites

An AMD Instinct VM (MI300X or MI355X) or a CMK cluster with an AMD GPU NodePool
For standalone VMs, create the VM from an AMD ROCm image. These images ship ROCm — including amd-smi at /opt/rocm/bin — and the amdgpu driver preinstalled, so the commands in this article work without any further setup.
sudo access on the node (required for kernel logs and the full log bundle)
For CMK, the AMD GPU Operator installed on the cluster. See How-To Install AMD GPU Operator on a Crusoe Managed Kubernetes Cluster.

Instructions

Standalone AMD VMs

Run the following commands directly on the VM. When opening a ticket, capture the output of all of them.

1. Confirm the GPUs are detected and check versions

# List all AMD GPUs and confirm they enumerate
amd-smi list

# Driver, amd-smi, and ROCm versions (include this in every ticket)
amd-smi version

If the GPUs do not appear, the driver is likely not loaded — see Troubleshooting below.

2. Capture GPU status and metrics

This is the amd-smi equivalent of nvidia-smi.

# Point-in-time metrics: utilization, temperature, power, clocks, memory
amd-smi metric

# Static information: board, VBIOS/firmware, partition mode, serial number
amd-smi static

3. Capture ECC error counts

This is the equivalent of nvidia-smi -q -d ECC. Capture the ECC error counts, including the per-block breakdown.

# ECC error counts, including per-block breakdown
amd-smi metric --ecc
amd-smi metric --ecc-blocks

⚠️ Warning: Pay attention to uncorrectable (UE) counts. Correctable (CE) errors are handled in hardware, but uncorrectable errors that are non-zero indicate a hardware fault that should be reported.

4. Capture bad (retired) pages

This is the closest equivalent to NVIDIA row remapping. It reports memory pages that have been retired or are pending retirement.

amd-smi bad-pages

5. Capture XGMI link status

XGMI is the AMD inter-GPU interconnect that implements Infinity Fabric across GPUs (the equivalent of NVLink/NVSwitch). Capture this for any suspected multi-GPU or fabric issue.

amd-smi xgmi

6. Capture kernel logs

AMD does not have a direct equivalent of NVIDIA Xid codes. Instead, filter the kernel ring buffer for amdgpu driver messages. Look for GPU resets, ring timeouts, page faults, and RAS/uncorrectable-error lines.

sudo dmesg -T | grep -i amdgpu
sudo journalctl -k | grep -i amdgpu

7. (Optional) Collect a full log bundle

AMD publishes rocm_techsupport.sh, a script that collects ROCm and system logs in one pass (driver/firmware versions, amd-smi output, RAS info, XGMI errors, dmesg, lspci, and more). This is the closest equivalent to nvidia-bug-report.sh. For the script and usage instructions, see the amddcgpuce/rocmtechsupport repository.

CMK NodePool VMs (AMD GPU Operator)

On a CMK cluster with the AMD GPU Operator installed, the operator deploys a set of operand pods on each AMD GPU worker node, in the operator namespace (default: kube-amd-gpu). The metrics-exporter operand pod bundles amd-smi, so you can run the SMI commands by executing into that pod — the same pattern used for NVIDIA nodes in How-To Run nvidia-smi Commands on CMK.

Do not install ROCm on the host — the tooling runs inside the AMD GPU Operator pods.

If the operator is not yet installed, follow How-To Install AMD GPU Operator on a Crusoe Managed Kubernetes Cluster first.

1. Find the metrics-exporter pod on the affected node

The operand pod names are prefixed with your DeviceConfig name (the default DeviceConfig produces pods named default-metrics-exporter-<id>).

kubectl get pods -n kube-amd-gpu -o wide | grep <node_name> | grep metrics-exporter

ℹ️ Note: Replace kube-amd-gpu and the pod prefix with the namespace and DeviceConfig name used in your installation if they differ from the defaults.

2. Run amd-smi inside the pod

# GPU status and metrics
kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi metric

# ECC
kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi metric --ecc

# Bad pages and XGMI
kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi bad-pages
kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi xgmi

# Versions (include in every ticket)
kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi version

3. Capture kernel logs from the node

SSH into the node and capture this output from within the node:

sudo dmesg -T | grep -i amdgpu
sudo journalctl -k | grep -i amdgpu

Verification

You have collected a complete set of debugging information when you have:

amd-smi version output (driver, amd-smi, and ROCm versions).
amd-smi metric and amd-smi static output, with all expected GPUs enumerated (8 on a full MI300X/MI355X node).
amd-smi metric --ecc and amd-smi bad-pages output.
amd-smi xgmi output for fabric/multi-GPU issues.
amdgpu kernel log lines from dmesg.

A healthy node shows all GPUs present, zero uncorrectable ECC errors, no retired or pending bad pages, and no amdgpu error lines in dmesg.

What to Look For

Signal	Where	Meaning
Uncorrectable (UE) ECC errors > 0	`amd-smi metric --ecc`	Hardware fault. Capture logs and open a ticket.
Retired or pending bad pages	`amd-smi bad-pages`	Memory pages have been retired. Capture logs and open a ticket.
`amdgpu` GPU reset, ring timeout, or page fault	`dmesg` / `journalctl -k`	GPU hang or driver-level error. Capture full kernel logs.
XGMI link errors	`amd-smi xgmi`	Inter-GPU fabric issue. Capture for multi-GPU jobs.

If you see any of the above, capture the relevant output and open a support ticket with the logs attached.

Troubleshooting

Problem	Fix
`amd-smi list` shows no GPUs	The `amdgpu` driver may not be loaded. Run `sudo modprobe amdgpu`, then re-check with `dmesg \| grep -i amdgpu`.
`amd-smi: command not found` and `/opt/rocm` does not exist	The VM was not created from an AMD ROCm image. Recreate the VM using an AMD ROCm image (see Prerequisites). On a CMK node, do not use a host ROCm install — use the AMD GPU Operator pod instead (see the CMK section).

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.