Overview
After upgrading the Nvidia driver on a Slurm compute node, you may encounter the error message "We were configured with NVML functionality, but that lib wasn't found on the system" in the Slurm daemon (slurmd) logs. Additionally, sinfo
may show the node in an "invalid" state. This issue is typically caused by the absence of the libnvidia-ml.so
file, which is essential for NVML functionality.
Prerequisites
- Slurm Cluster
- SSH access to the compute VMs
- Root privilege
Steps
- Locate all libnvidia files on the affected Slurm compute node. The output will likely resemble this:
root@slurm-compute-node-0:~# find /usr -name "libnvidia-ml.so*"
/usr/local/cuda-12.4/targets/x86_64-linux/lib/stubs/libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.565.57.01 - The error most likely is due to a missing libnvidia-ml.so file, create a symbolic link for it.
root@slurm-compute-node-0:~# ln -s libnvidia-ml.so.1 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
- Verify the following libnvidia files to ensure the symbolic links are properly configured. The expected output should resemble this:
root@slurm-compute-node-0:~# ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
lrwxrwxrwx 1 root root 17 Dec 28 03:13 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> libnvidia-ml.so.1
root@slurm-compute-node-0:~# ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
lrwxrwxrwx 1 root root 25 Oct 15 08:24 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.565.57.01 - Restart slurm daemon (slurmd).
root@slurm-compute-node-0:~# systemctl restart slurmd
- The NVML error should no longer appear in the slurmd logs, which you can verify with the following command:
root@slurm-compute-node-0:~# journalctl -u slurmd
- As a result, the compute node should now display in the drain state.
root@slurm-compute-node-0:~# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 2 drain slurm-compute-node-[0-1]
login inact infinite 2 idle slurm-login-node-[0-1] - Move the nodes to resume (idle) state so they can begin accepting jobs.
root@slurm-compute-node-0:~# scontrol update NodeName=<node_name> State=RESUME
Comments
0 comments
Please sign in to leave a comment.