Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To resolve "'open /run/nvidia-persistenced/socket: no such file or directory'" Errors on GB200 Instances

Rasul Imanov
Rasul Imanov
Updated

Introduction

GPU pods on a GB200 VM can fail to start with 'open /run/nvidia-persistenced/socket: no such file or directory' when the nvidia-persistenced daemon is not running on the VM. The Kubernetes NVIDIA device plugin relies on the daemon's socket at /run/nvidia-persistenced/socket to expose GPUs into pods, so when the daemon stops, new GPU pods cannot start. The most common cause is the Linux OOM killer terminating the daemon during a memory-intensive workload. This guide shows you how to confirm the cause and restart the daemon to bring GPU workloads back online.

Prerequisites

  • SSH access with sudo on the affected GB200 VM. 
  • kubectl access to the cluster.

Instructions

  1. Confirm the daemon is not running

    On the affected VM, run:

    sudo systemctl status nvidia-persistenced --no-pager

    If the output shows 'Active: failed (Result: oom-kill)' or any state other than 'active (running)', continue. If it shows 'active (running)', this article does not apply.

  2. Restart the daemon

    sudo systemctl restart nvidia-persistenced
  3. Verify persistence mode is enabled on every GPU

    nvidia-smi -q | grep "Persistence Mode"

    Every line must read 'Enabled'

  4. Delete the failing pod

    Delete the failing pod; it will reschedule and start cleanly once the daemon is running.

Example

After the restart, sudo systemctl status nvidia-persistenced --no-pager should report the daemon as 'active (running)', and nvidia-smi -q | grep "Persistence Mode" should return 'Enabled' for every GPU:

 
    Persistence Mode                      : Enabled
    Persistence Mode                      : Enabled
[...]

Re-created GPU pods should then move through 'ContainerCreating' and reach 'Running' without the socket error.

Additional Information

 

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.