Introduction
This article shows you how to collect GPU debugging information on AMD Instinct nodes (MI300X, MI355X) so it can be attached to a support ticket. It covers two environments: standalone AMD VMs and CMK NodePool VMs running the AMD GPU Operator. It is the AMD counterpart to the NVIDIA workflow (nvidia-smi, nvidia-bug-report.sh, dmesg).
On AMD nodes, the tool for this is amd-smi (the AMD System Management Interface), which ships with ROCm.
Prerequisites
- An AMD Instinct VM (MI300X or MI355X) or a CMK cluster with an AMD GPU NodePool
- For standalone VMs, create the VM from an AMD ROCm image. These images ship ROCm — including
amd-smiat/opt/rocm/bin— and theamdgpudriver preinstalled, so the commands in this article work without any further setup. -
sudoaccess on the node (required for kernel logs and the full log bundle) - For CMK, the AMD GPU Operator installed on the cluster. See How-To Install AMD GPU Operator on a Crusoe Managed Kubernetes Cluster.
Instructions
Standalone AMD VMs
Run the following commands directly on the VM. When opening a ticket, capture the output of all of them.
1. Confirm the GPUs are detected and check versions
# List all AMD GPUs and confirm they enumerate amd-smi list # Driver, amd-smi, and ROCm versions (include this in every ticket) amd-smi version
If the GPUs do not appear, the driver is likely not loaded — see Troubleshooting below.
2. Capture GPU status and metrics
This is the amd-smi equivalent of nvidia-smi.
# Point-in-time metrics: utilization, temperature, power, clocks, memory amd-smi metric # Static information: board, VBIOS/firmware, partition mode, serial number amd-smi static
3. Capture ECC error counts
This is the equivalent of nvidia-smi -q -d ECC. Capture the ECC error counts, including the per-block breakdown.
# ECC error counts, including per-block breakdown amd-smi metric --ecc amd-smi metric --ecc-blocks
⚠️ Warning: Pay attention to uncorrectable (UE) counts. Correctable (CE) errors are handled in hardware, but uncorrectable errors that are non-zero indicate a hardware fault that should be reported.
4. Capture bad (retired) pages
This is the closest equivalent to NVIDIA row remapping. It reports memory pages that have been retired or are pending retirement.
amd-smi bad-pages
5. Capture XGMI link status
XGMI is the AMD inter-GPU interconnect that implements Infinity Fabric across GPUs (the equivalent of NVLink/NVSwitch). Capture this for any suspected multi-GPU or fabric issue.
amd-smi xgmi
6. Capture kernel logs
AMD does not have a direct equivalent of NVIDIA Xid codes. Instead, filter the kernel ring buffer for amdgpu driver messages. Look for GPU resets, ring timeouts, page faults, and RAS/uncorrectable-error lines.
sudo dmesg -T | grep -i amdgpu sudo journalctl -k | grep -i amdgpu
7. (Optional) Collect a full log bundle
AMD publishes rocm_techsupport.sh, a script that collects ROCm and system logs in one pass (driver/firmware versions, amd-smi output, RAS info, XGMI errors, dmesg, lspci, and more). This is the closest equivalent to nvidia-bug-report.sh. For the script and usage instructions, see the amddcgpuce/rocmtechsupport repository.
CMK NodePool VMs (AMD GPU Operator)
On a CMK cluster with the AMD GPU Operator installed, the operator deploys a set of operand pods on each AMD GPU worker node, in the operator namespace (default: kube-amd-gpu). The metrics-exporter operand pod bundles amd-smi, so you can run the SMI commands by executing into that pod — the same pattern used for NVIDIA nodes in How-To Run nvidia-smi Commands on CMK.
Do not install ROCm on the host — the tooling runs inside the AMD GPU Operator pods.
If the operator is not yet installed, follow How-To Install AMD GPU Operator on a Crusoe Managed Kubernetes Cluster first.
1. Find the metrics-exporter pod on the affected node
The operand pod names are prefixed with your DeviceConfig name (the default DeviceConfig produces pods named default-metrics-exporter-<id>).
kubectl get pods -n kube-amd-gpu -o wide | grep <node_name> | grep metrics-exporter
ℹ️ Note: Replace
kube-amd-gpuand the pod prefix with the namespace andDeviceConfigname used in your installation if they differ from the defaults.
2. Run amd-smi inside the pod
# GPU status and metrics kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi metric # ECC kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi metric --ecc # Bad pages and XGMI kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi bad-pages kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi xgmi # Versions (include in every ticket) kubectl -n kube-amd-gpu exec -it <metrics-exporter-pod> -- amd-smi version
3. Capture kernel logs from the node
SSH into the node and capture this output from within the node:
sudo dmesg -T | grep -i amdgpu sudo journalctl -k | grep -i amdgpu
Verification
You have collected a complete set of debugging information when you have:
-
amd-smi versionoutput (driver,amd-smi, and ROCm versions). -
amd-smi metricandamd-smi staticoutput, with all expected GPUs enumerated (8 on a full MI300X/MI355X node). -
amd-smi metric --eccandamd-smi bad-pagesoutput. -
amd-smi xgmioutput for fabric/multi-GPU issues. -
amdgpukernel log lines fromdmesg.
A healthy node shows all GPUs present, zero uncorrectable ECC errors, no retired or pending bad pages, and no amdgpu error lines in dmesg.
What to Look For
| Signal | Where | Meaning |
|---|---|---|
| Uncorrectable (UE) ECC errors > 0 | amd-smi metric --ecc |
Hardware fault. Capture logs and open a ticket. |
| Retired or pending bad pages | amd-smi bad-pages |
Memory pages have been retired. Capture logs and open a ticket. |
amdgpu GPU reset, ring timeout, or page fault |
dmesg / journalctl -k
|
GPU hang or driver-level error. Capture full kernel logs. |
| XGMI link errors | amd-smi xgmi |
Inter-GPU fabric issue. Capture for multi-GPU jobs. |
If you see any of the above, capture the relevant output and open a support ticket with the logs attached.
Troubleshooting
| Problem | Fix |
|---|---|
amd-smi list shows no GPUs |
The amdgpu driver may not be loaded. Run sudo modprobe amdgpu, then re-check with dmesg | grep -i amdgpu. |
amd-smi: command not found and /opt/rocm does not exist |
The VM was not created from an AMD ROCm image. Recreate the VM using an AMD ROCm image (see Prerequisites). On a CMK node, do not use a host ROCm install — use the AMD GPU Operator pod instead (see the CMK section). |