Introduction
GPU pods on a GB200 VM can fail to start with 'open /run/nvidia-persistenced/socket: no such file or directory' when the nvidia-persistenced daemon is not running on the VM. The Kubernetes NVIDIA device plugin relies on the daemon's socket at /run/nvidia-persistenced/socket to expose GPUs into pods, so when the daemon stops, new GPU pods cannot start. The most common cause is the Linux OOM killer terminating the daemon during a memory-intensive workload. This guide shows you how to confirm the cause and restart the daemon to bring GPU workloads back online.
Prerequisites
- SSH access with
sudoon the affected GB200 VM. -
kubectlaccess to the cluster.
Instructions
-
Confirm the daemon is not running
On the affected VM, run:
sudo systemctl status nvidia-persistenced --no-pagerIf the output shows 'Active: failed (Result: oom-kill)' or any state other than 'active (running)', continue. If it shows 'active (running)', this article does not apply.
-
Restart the daemon
sudo systemctl restart nvidia-persistenced -
Verify persistence mode is enabled on every GPU
nvidia-smi -q | grep "Persistence Mode"Every line must read 'Enabled'
-
Delete the failing pod
Delete the failing pod; it will reschedule and start cleanly once the daemon is running.
Example
After the restart, sudo systemctl status nvidia-persistenced --no-pager should report the daemon as 'active (running)', and nvidia-smi -q | grep "Persistence Mode" should return 'Enabled' for every GPU:
Persistence Mode : Enabled
Persistence Mode : Enabled
[...]Re-created GPU pods should then move through 'ContainerCreating' and reach 'Running' without the socket error.
Additional Information