Last Updated: Mar 31, 2026
Overview:
This article covers a scenario where Slurm compute nodes remain stuck in the PLND (planned) state, preventing jobs from starting even though the nodes appear idle and healthy.
In this state, attempts to manually reset nodes (for example, using scontrol update State=IDLE) do not take effect.
Important:
The PLND state is not an error condition. It indicates that Slurm has reserved the node(s) for a future job. While the node may appear idle, it is intentionally withheld from scheduling lower-priority jobs.
This situation is commonly misunderstood as:
- Node failure or unresponsiveness
- Slurm daemon (
slurmd) issues - Infrastructure or VM-level problems
However, in many cases, the root cause is scheduler behavior, specifically future job reservations created by Slurm’s backfill scheduler.
Prerequisites:
- Access to a Slurm cluster
- Permissions to run Slurm commands (
sinfo,squeue,scontrol) - Access to login or head node
- Basic familiarity with Slurm scheduling
Step-by-Step Instructions:
1. Confirm Node State
Check the current state of nodes:
$ sinfo
Example output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 2 plnd slurm-compute-node-[9,15]
If nodes show plnd, proceed to the next steps.
2. Attempt Manual Reset (Expected to Fail)
Try setting the node to IDLE:
$ sudo scontrol update NodeName="slurm-compute-node-[9,15]" State=IDLE
If the state does not change, this confirms that PLND is being enforced by the scheduler, not manual state.
3. Check for Reservations
$ scontrol show reservations
- If reservations exist → nodes may be reserved for maintenance or admin purposes
- If no reservations are present, continue - this is expected in backfill scenarios
4. Inspect Node Details
$ scontrol show node <node-name>
Example (for the above scenario):
$ scontrol show node slurm-compute-node-9
$ scontrol show node slurm-compute-node-15
Look for:
State=IDLE+PLANNED- Any
Reason=field (e.g.,Not responding)
If nodes appear healthy but still PLANNED, continue.
5. Check for Pending Jobs
$ squeue -t PD
Look for jobs with:
- Large node requirements
- Future start times
- High priority
6. Identify Future Reservation (Root Cause)
Inspect specific pending jobs:
$ scontrol show job <job_id>
Look for fields like:
-
StartTime=...(future timestamp) -
SchedNodeList=...(includes affected nodes)
Key Insight:
If a job has a future start time and requires specific nodes, Slurm reserves those nodes in advance, placing them in PLND.
7. Validate Scheduler Behavior
At this point:
- Nodes are healthy ✅
- No explicit reservations exist ✅
- Nodes are still
PLND✅
This confirms the cause:
👉 Slurm backfill scheduler is reserving nodes for a future job
8. Release Nodes (If Needed)
To make nodes available immediately, modify the blocking job:
Option A: Hold the job
scontrol hold <job_id>
Option B: Cancel the job
scancel <job_id>
Option C: Adjust job requirements
- Reduce node count
- Reduce time limit
Once the job is no longer schedulable, the PLND state will clear.
9. Verify Node State
sinfo -N -o "%N %T"
Expected result:
slurm-compute-node-9 IDLE
slurm-compute-node-15 IDLE
Resolution:
Nodes stuck in PLND state are typically reserved by Slurm for a future scheduled job, not due to node failure or infrastructure issues.
In this case:
- A pending job with a future
StartTimerequires multiple nodes - Slurm reserved those nodes in advance
- This prevented other jobs from running, even though nodes appeared idle
Fix:
Modify or remove the blocking job (hold, cancel, or adjust constraints) to release the nodes back to the scheduler.
Additional Notes:
-
PLNDoverridesIDLE- manual state changes will not persist - This behavior is expected in clusters using backfill scheduling
- Misinterpreting
PLNDas a failure can lead to unnecessary debugging (e.g., restartingslurmd) - Always check pending jobs before investigating node health
Additional Resources
- Slurm documentation: Node states and scheduling behavior
-
man scontrol(for PLANNED state details) -
squeue,sinfo,scontrolcommand references