Nvidia Driver Upgrade Causing Invalid Slurm Compute Nodes

Last Updated: Oct 27, 2025

Overview

After upgrading the Nvidia driver on a Slurm compute node, you may encounter the following error message in the Slurm daemon (slurmd) logs:

We were configured with NVML functionality, but that lib wasn't found on the system

Additionally, running sinfo may show the node in an invalid state. This issue typically occurs because the libnvidia-ml.so file—required for NVML functionality—is missing or improperly linked.

Prerequisites

Slurm setup
SSH access to the affected Crusoe compute nodes
Root privileges

Step-by-Step Instructions

Locate Existing Nvidia Libraries
Run the following command on the affected compute node to locate all libnvidia files:

$ root@slurm-compute-node-0:~# find /usr -name "libnvidia-ml.so*"

/usr/local/cuda-12.4/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.565.57.01

Create Missing Symbolic Link
The error most likely is due to a missing libnvidia-ml.so file, create a symbolic link for it:
```
$ root@slurm-compute-node-0:~# ln -s libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
```

Verify the Symbolic Links
Confirm that the following libnvidia files to ensure the symbolic links are properly configured. The expected output should resemble this:

$ root@slurm-compute-node-0:~# ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
lrwxrwxrwx 1 root root 17 Dec 28 03:13 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> libnvidia-ml.so.1

$ root@slurm-compute-node-0:~# ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
lrwxrwxrwx 1 root root 25 Oct 15 08:24 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.565.57.01

Restart the Slurm Daemon
Restart slurmd to apply the changes:
```
$ root@slurm-compute-node-0:~# systemctl restart slurmd
```
Verify NVML Error Resolution
The NVML error should no longer appear in the slurmd logs, which you can verify with the following command:
```
$ journalctl -u slurmd
```

Verify Node State
After resolving the issue, the node should appear in the drain state:

$ root@slurm-compute-node-0:~# sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch*    up    infinite   2     drain slurm-compute-node-[0-1]
login     inact infinite   2     idle  slurm-login-node-[0-1]

Resume the Node
To return the node to the idle state (ready to accept jobs), run:
```
$ root@slurm-compute-node-0:~# scontrol update NodeName=<node_name> State=RESUME
```
Replace <node_name> with the actual compute node name.

Additional Resources

Crusoe Slurm repository

Related to

slurm driver solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Prerequisites

Step-by-Step Instructions

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

Nvidia Driver Upgrade Causing Invalid Slurm Compute Nodes

Overview

Prerequisites

Step-by-Step Instructions

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments