Skip to main content
Crusoe Support Help Center home page
Crusoe

Slurm Nodes in PLND State: Why Jobs Aren’t Starting and How to Fix It

Tanaya Atmaram Kambli
Tanaya Atmaram Kambli
Updated

Last Updated: Mar 31, 2026

Overview:

This article covers a scenario where Slurm compute nodes remain stuck in the PLND (planned) state, preventing jobs from starting even though the nodes appear idle and healthy.

In this state, attempts to manually reset nodes (for example, using scontrol update State=IDLE) do not take effect.

Important:
The PLND state is not an error condition. It indicates that Slurm has reserved the node(s) for a future job. While the node may appear idle, it is intentionally withheld from scheduling lower-priority jobs.

This situation is commonly misunderstood as:

  • Node failure or unresponsiveness
  • Slurm daemon (slurmd) issues
  • Infrastructure or VM-level problems

However, in many cases, the root cause is scheduler behavior, specifically future job reservations created by Slurm’s backfill scheduler.

Prerequisites:

  • Access to a Slurm cluster
  • Permissions to run Slurm commands (sinfo, squeue, scontrol)
  • Access to login or head node
  • Basic familiarity with Slurm scheduling

Step-by-Step Instructions:

1. Confirm Node State

Check the current state of nodes:

$ sinfo

Example output:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch* up infinite 2 plnd slurm-compute-node-[9,15]

If nodes show plnd, proceed to the next steps.

2. Attempt Manual Reset (Expected to Fail)

Try setting the node to IDLE:

$ sudo scontrol update NodeName="slurm-compute-node-[9,15]" State=IDLE

If the state does not change, this confirms that PLND is being enforced by the scheduler, not manual state.

3. Check for Reservations

$ scontrol show reservations
  • If reservations exist → nodes may be reserved for maintenance or admin purposes
  • If no reservations are present, continue - this is expected in backfill scenarios

4. Inspect Node Details

$ scontrol show node <node-name>
Example (for the above scenario):
$ scontrol show node slurm-compute-node-9
$ scontrol show node slurm-compute-node-15

Look for:

  • State=IDLE+PLANNED
  • Any Reason= field (e.g., Not responding)

If nodes appear healthy but still PLANNED, continue.

5. Check for Pending Jobs

$ squeue -t PD

Look for jobs with:

  • Large node requirements
  • Future start times
  • High priority

6. Identify Future Reservation (Root Cause)

Inspect specific pending jobs:

$ scontrol show job <job_id>

Look for fields like:

  • StartTime=... (future timestamp)
  • SchedNodeList=... (includes affected nodes)

Key Insight:
If a job has a future start time and requires specific nodes, Slurm reserves those nodes in advance, placing them in PLND.

7. Validate Scheduler Behavior

At this point:

  • Nodes are healthy ✅
  • No explicit reservations exist ✅
  • Nodes are still PLND

This confirms the cause:

👉 Slurm backfill scheduler is reserving nodes for a future job

8. Release Nodes (If Needed)

To make nodes available immediately, modify the blocking job:

Option A: Hold the job

scontrol hold <job_id>

Option B: Cancel the job

scancel <job_id>

Option C: Adjust job requirements

  • Reduce node count
  • Reduce time limit

Once the job is no longer schedulable, the PLND state will clear.

9. Verify Node State

sinfo -N -o "%N %T"

Expected result:

slurm-compute-node-9  IDLE
slurm-compute-node-15 IDLE

Resolution:

Nodes stuck in PLND state are typically reserved by Slurm for a future scheduled job, not due to node failure or infrastructure issues.

In this case:

  • A pending job with a future StartTime requires multiple nodes
  • Slurm reserved those nodes in advance
  • This prevented other jobs from running, even though nodes appeared idle

Fix:
Modify or remove the blocking job (hold, cancel, or adjust constraints) to release the nodes back to the scheduler.

Additional Notes:

  • PLND overrides IDLE - manual state changes will not persist
  • This behavior is expected in clusters using backfill scheduling
  • Misinterpreting PLND as a failure can lead to unnecessary debugging (e.g., restarting slurmd)
  • Always check pending jobs before investigating node health

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.