How-To resolve "'open /run/nvidia-persistenced/socket: no such file or directory'" Errors on GB200 Instances

Introduction

GPU pods on a GB200 VM can fail to start with 'open /run/nvidia-persistenced/socket: no such file or directory' when the nvidia-persistenced daemon is not running on the VM. The Kubernetes NVIDIA device plugin relies on the daemon's socket at /run/nvidia-persistenced/socket to expose GPUs into pods, so when the daemon stops, new GPU pods cannot start. The most common cause is the Linux OOM killer terminating the daemon during a memory-intensive workload. This guide shows you how to confirm the cause and restart the daemon to bring GPU workloads back online.

Prerequisites

SSH access with sudo on the affected GB200 VM.
kubectl access to the cluster.

Instructions

Confirm the daemon is not running

On the affected VM, run:
```
sudo systemctl status nvidia-persistenced --no-pager
```
If the output shows 'Active: failed (Result: oom-kill)' or any state other than 'active (running)', continue. If it shows 'active (running)', this article does not apply.

Restart the daemon

sudo systemctl restart nvidia-persistenced

Verify persistence mode is enabled on every GPU
```
nvidia-smi -q | grep "Persistence Mode"
```
Every line must read 'Enabled'
Delete the failing pod

Delete the failing pod; it will reschedule and start cleanly once the daemon is running.

Example

After the restart, sudo systemctl status nvidia-persistenced --no-pager should report the daemon as 'active (running)', and nvidia-smi -q | grep "Persistence Mode" should return 'Enabled' for every GPU:

    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
[...]

Re-created GPU pods should then move through 'ContainerCreating' and reach 'Running' without the socket error.

Additional Information

Related to

GPU how-to nvidia kubernetes gb200 troubleshooting

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

Instructions

Example

Additional Information

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To resolve "'open /run/nvidia-persistenced/socket: no such file or directory'" Errors on GB200 Instances

Introduction

Prerequisites

Instructions

Example

Additional Information

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments