How-To Recover a Slurm Node Drained Due to Full Root Disk

Introduction

When a Slurm node's root disk fills up completely, Slurmd automatically drains the node and marks it unavailable with the reason SlurmdSpoolDir is full. The SlurmdSpoolDir is the directory where Slurmd stores its state files — when the filesystem it lives on runs out of space, Slurmd can no longer write its own state and drains the node as a safety measure.

The node will show a DRAIN state in squeue or scontrol output, and may also appear as down with the reason "Not responding" in sinfo -R output. It will stop accepting new jobs until the disk is cleared and the node is manually resumed.

This commonly surfaces on nodes that have been running long jobs for extended periods. Temporary files, job output, core dumps, and scratch data written to /tmp or /var/log can accumulate silently over time until the root disk hits capacity.

This is a guest-side issue — the root disk lives inside your VM and Crusoe does not have access to clean it up on your behalf. This guide walks through identifying what is consuming the disk, clearing it safely, and resuming the node.

Prerequisites

SSH Access to the Affected Node
Sudo Privileges on the Node
Access to Run scontrol Commands from the Slurm Head Node

Instructions

Confirm the Drain Reason
- From your head node, run:
```
scontrol show node <node-name>
```
- Look for the Reason field in the output. If it reads SlurmdSpoolDir is full, a full root disk is the cause.
- Also check the disk usage from inside the node:
```
df -h
```
- A root partition at 100% usage confirms the issue.
Identify What Is Consuming the Disk
- Check individual top-level directories one at a time. Running du -sh /* can hang on a completely full disk, so check directories separately:
```
du -sh /var/log
du -sh /tmp
du -sh /home
du -sh /opt
du -sh /var/spool
```
- ℹ️ Note: /tmp is a common culprit, especially on nodes that have been running long jobs. Job output files, core dumps, and scratch data can accumulate there over time.
- If du is still hanging, check for open but deleted files. A process may be holding a large deleted file open, preventing disk space from being reclaimed:
```
lsof +L1
```
- If lsof shows large files, restarting the associated process will free the space immediately without needing to delete anything manually.
Drill Into the Largest Directory
- Once you have identified the largest directory, drill down to find specific files. For example, if /tmp was found to be the largest:
```
du -sh /tmp/* 2>/dev/null | sort -rh | head -20
```
Clear the Disk
- If the directory is the culprit and no critical processes are actively using those files, it is generally safe to clear. For example, for /tmp:
```
sudo rm -rf /tmp/*
```
- ⚠️ Warning: Only you have visibility into what is running on your node. Verify nothing critical depends on files in the target directory before deleting. Cross-reference against lsof +L1 output to check for open file handles.
- After clearing, confirm disk space has been freed:
```
df -h
```
Resume the Node in Slurm
- Once the disk has sufficient free space, resume the node from your head node:
```
scontrol update nodename=<node-name> state=resume
```
- Verify the node is back in a healthy state:
```
scontrol show node <node-name>
```
- The State field should no longer show DRAIN.

Example

A Slurm node running long training jobs accumulates 107G of temporary files in /tmp over several days, filling the 124G root disk entirely. Slurmd detects the full spool directory and drains the node with the reason SlurmdSpoolDir is full. The user SSHs into the node, confirms /tmp is the culprit using du -sh, clears it with sudo rm -rf /tmp/*, and resumes the node with scontrol update nodename=<node-name> state=resume. The node returns to a healthy schedulable state immediately.

Related to

slurm how-to

Introduction

Prerequisites

Instructions

Example

Related Articles

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments