Introduction
If you are running Slurm, either self-managed or through our reference architecture, then you have likely witnessed Slurm automatically drain instances that have exhibited host failures. Once the instance has been migrated to a healthy host, Slurm might continue to show the instance in a DRAIN state.
ubuntu@slurm-compute-node-2:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 1 drain slurm-compute-node-0
The following command can also be used to query the state of a Slurm node.
# scontrol show node <node-name>
Prerequisites
- SSH access to Slurm head and compute node.
Solution
When Slurm puts a node into a DRAIN state, it does not automatically detect when that instance gets migrated to a healthy server. This may result in a node as reporting down, despite nvidia-smi
reporting all GPUs as available. To return the node into the Slurm pool, please refer to the following steps.
1. First, log into the affected compute node, and restart slurmd, and reboot the node
# sudo systemctl restart slurmd
# sudo reboot now
2. If that does not recover the state, log into the Slurm head node, and restart slurmctld
# sudo systemctl restart slurmctld
3. Finally, you can use the following command to force update Slurm to return the node to the pool
# scontrol update NodeName=<node-name> State=RESUME
4. From there, you can use sinfo
to confirm, and use srun
to check for all GPUs
# sinfo -R
# srun -N 1 --gres=gpu:8 nvidia-smi -w slurm-compute-node-0 nvidia-smi -L
If you have any trouble getting your nodes back online to receive jobs, please reach out to Support.
Comments
0 comments
Article is closed for comments.