How-To Recover Slurm Nodes From a Drain State

Last Updated: Oct 16, 2025

Introduction

If you are running Slurm, either self-managed or through our reference architecture, then you have likely witnessed Slurm automatically draining instances that have exhibited host failures. Once the instance has been migrated to a healthy host, Slurm might continue to show the instance in a DRAIN state.

ubuntu@slurm-compute-node-2:~$ sinfo
PARTITION  AVAIL   TIMELIMIT   NODES  STATE   NODELIST
batch*      up     infinite      1    drain   slurm-compute-node-0

The following command can also be used to query the state of a Slurm node.

# scontrol show node <node-name>

Prerequisites

SSH access to both the Slurm head node and the affected compute node.

Step-by-Step Instructions

When Slurm puts a node into a DRAIN state, it does not automatically detect when that instance gets migrated to a healthy server. This may result in a node as reporting down, despite nvidia-smi reporting all GPUs as available. To return the node into the Slurm pool, please refer to the following steps.

Step 1: Restart slurmd and reboot the Node

Log into the affected compute node, and restart slurmd, and reboot the node

$ sudo systemctl restart slurmd
$ sudo reboot now

Step 2: Restart slurmctld

If Step 1 does not recover the state, log into the Slurm head node, and restart slurmctld service

$ sudo systemctl restart slurmctld

Step 3: Force update the Node state

If the node still remains in DRAIN state, you can use the following command to force update Slurm to return the node to the pool. Replace <compute-node-name> with your node's actual name.

$ scontrol update NodeName=<compute-node-name> State=RESUME

From there, you can use sinfo to confirm, and use srun to check for all GPUs. Replace <compute-node-name> with your node's actual name.

$ sinfo -R
$ srun -N 1 --gres=gpu:8 -w <compute-node-name> nvidia-smi -L

If you have any trouble getting your nodes back online to receive jobs, please reach out to Support.

Related to

GPU slurm drain recover how-to

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Introduction

Prerequisites

Step-by-Step Instructions

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

How-To Recover Slurm Nodes From a Drain State

Introduction

Prerequisites

Step-by-Step Instructions

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments