Introduction
When a Slurm node's root disk fills up completely, Slurmd automatically drains the node and marks it unavailable with the reason SlurmdSpoolDir is full. The SlurmdSpoolDir is the directory where Slurmd stores its state files — when the filesystem it lives on runs out of space, Slurmd can no longer write its own state and drains the node as a safety measure.
The node will show a DRAIN state in squeue or scontrol output, and may also appear as down with the reason "Not responding" in sinfo -R output. It will stop accepting new jobs until the disk is cleared and the node is manually resumed.
This commonly surfaces on nodes that have been running long jobs for extended periods. Temporary files, job output, core dumps, and scratch data written to /tmp or /var/log can accumulate silently over time until the root disk hits capacity.
This is a guest-side issue — the root disk lives inside your VM and Crusoe does not have access to clean it up on your behalf. This guide walks through identifying what is consuming the disk, clearing it safely, and resuming the node.
Prerequisites
- SSH Access to the Affected Node
- Sudo Privileges on the Node
- Access to Run
scontrolCommands from the Slurm Head Node
Instructions
-
Confirm the Drain Reason
-
From your head node, run:
scontrol show node <node-name>
- Look for the
Reasonfield in the output. If it readsSlurmdSpoolDir is full, a full root disk is the cause. -
Also check the disk usage from inside the node:
df -h
- A root partition at 100% usage confirms the issue.
-
-
Identify What Is Consuming the Disk
-
Check individual top-level directories one at a time. Running
du -sh /*can hang on a completely full disk, so check directories separately:du -sh /var/log du -sh /tmp du -sh /home du -sh /opt du -sh /var/spool
ℹ️ Note:
/tmpis a common culprit, especially on nodes that have been running long jobs. Job output files, core dumps, and scratch data can accumulate there over time.-
If
duis still hanging, check for open but deleted files. A process may be holding a large deleted file open, preventing disk space from being reclaimed:lsof +L1
- If
lsofshows large files, restarting the associated process will free the space immediately without needing to delete anything manually.
-
-
Drill Into the Largest Directory
-
Once you have identified the largest directory, drill down to find specific files. For example, if
/tmpwas found to be the largest:du -sh /tmp/* 2>/dev/null | sort -rh | head -20
-
-
Clear the Disk
-
If the directory is the culprit and no critical processes are actively using those files, it is generally safe to clear. For example, for
/tmp:sudo rm -rf /tmp/*
⚠️ Warning: Only you have visibility into what is running on your node. Verify nothing critical depends on files in the target directory before deleting. Cross-reference against
lsof +L1output to check for open file handles.-
After clearing, confirm disk space has been freed:
df -h
-
-
Resume the Node in Slurm
-
Once the disk has sufficient free space, resume the node from your head node:
scontrol update nodename=<node-name> state=resume
-
Verify the node is back in a healthy state:
scontrol show node <node-name>
- The
Statefield should no longer showDRAIN.
-
Example
A Slurm node running long training jobs accumulates 107G of temporary files in /tmp over several days, filling the 124G root disk entirely. Slurmd detects the full spool directory and drains the node with the reason SlurmdSpoolDir is full. The user SSHs into the node, confirms /tmp is the culprit using du -sh, clears it with sudo rm -rf /tmp/*, and resumes the node with scontrol update nodename=<node-name> state=resume. The node returns to a healthy schedulable state immediately.