Skip to main content
Crusoe Support Help Center home page
Crusoe

How-To Customize slurm.conf on Crusoe Managed Slurm

Sagar Lulla
Sagar Lulla
Updated

Introduction

Crusoe Managed Slurm renders slurm.conf automatically from a Kubernetes CRD. The Crusoe Slurm Operator and the upstream Slinky operator both regenerate the file on every reconcile, so direct edits on the controller pod (e.g., vim /etc/slurm/slurm.conf) get reverted within seconds.

To persistently change Slurm settings - adding a Prolog/Epilog, raising MaxJobCount, tuning SchedulerParameters, etc. - you append directives to the extraConf field on the Slinky Controller CRD. The operator writes them into the rendered slurm.conf between marker comments, and a sidecar in the controller pod runs scontrol reconfigure automatically when the ConfigMap changes.

This article shows the supported workflow and the safest mechanics for applying it.

Prerequisites

  • Crusoe Managed Slurm Cluster Already Provisioned
  • Kubeconfig Access to the Underlying CMK Cluster
  • A slurm.conf Directive You Want to Set (e.g., Prolog, MaxJobCount, SchedulerParameters)

Instructions

Step 1: Understand Where the Source of Truth Lives

The rendered slurm.conf on the controller pod is generated by the Slinky operator from the Controller CRD’s spec.extraConf field. Inspect the CRD and the current value:

kubectl get controllers.slinky.slurm.net -n slurm
kubectl get controllers.slinky.slurm.net <name> -n slurm -o jsonpath='{.spec.extraConf}'

The rendered file on disk has a clearly marked operator-injected block:

### EXTRA CONFIG ###

# THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR
# ===============================START======================================
SlurmctldDebug=debug5
...
PartitionName=all Nodes=ALL Default=YES MaxTime=UNLIMITED State=UP
# ================================END=======================================

Your custom directives go in the empty space above the START marker — that section is preserved across reconciles. Anything you put between START and END will be overwritten on the next operator pass.

⚠️ Warning: Do not exec into the slurmctld pod and edit /etc/slurm/slurm.conf directly. The file is mounted from a projected ConfigMap, and the operator regenerates it on every reconcile. Edits made on disk will disappear.

Step 2: Choose Your Edit Mechanism

kubectl edit controllers.slinky.slurm.net ... works but is fragile for extraConf:

  • The CRD often stores extraConf as a YAML flow scalar ("...\n...") rather than a literal block (|), which makes manual editing error-prone — every newline and quote must be exactly right or the rendered file fails to parse.
  • Terminals with bracketed-paste or auto-indent (zsh, oh-my-zsh) often corrupt heredocs and multi-line YAML when copy-pasting.

The reliable pattern is get → fix → patch:

  1. Export the current extraConf to a plain text file.
  2. Edit the text file with whatever editor handles your terminal cleanly.
  3. Patch the resource using kubectl patch --type=merge --patch-file, which sends the file contents byte-for-byte to the API server.

Step 3: Apply a Worked Example - Raise MaxJobCount

Export the current value to a file:

kubectl get controllers.slinky.slurm.net <name> -n slurm -o jsonpath='{.spec.extraConf}' > /tmp/current.txt
cat /tmp/current.txt

Edit /tmp/current.txt with any editor. Add your directive above the operator’s START marker:

# ===== USER SECTION =====
MaxJobCount=20000
# ===== END USER SECTION =====

# THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR
# ===============================START======================================
... (leave this block alone) ...
# ================================END=======================================

Build a JSON merge patch from the edited file:

python3 -c "import json;print(json.dumps({'spec':{'extraConf':open('/tmp/current.txt').read()}}))" > /tmp/patch.json

Apply it:

kubectl patch controllers.slinky.slurm.net <name> -n slurm --type=merge --patch-file=/tmp/patch.json

⚠️ Warning: Each Slurm directive must be on its own line. If two directives end up on the same line (e.g., MaxJobCount=20000 Prolog=/...), the rendered slurm.conf will fail to parse, the reconfigure sidecar will keep retrying, and the cluster will silently continue running the previous good config.

Step 4: Verify the Change Took Effect

Within roughly 10 seconds the operator regenerates the ConfigMap, kubelet swaps the projected file atomically inside the controller pod, and the reconfigure sidecar runs scontrol reconfigure. Confirm each step:

# Sidecar fired successfully
kubectl logs <controller-pod> -n slurm -c reconfigure --tail=10

# Rendered file picked up your change
kubectl exec <controller-pod> -n slurm -c slurmctld -- cat /etc/slurm/slurm.conf | grep -A1 "USER SECTION"

# Live slurmctld config sees it
kubectl exec <controller-pod> -n slurm -c slurmctld -- scontrol show config | grep -i MaxJobCount

You should see:

  • A fresh Reconfiguring Slurm... / SUCCESS entry in the sidecar log
  • MaxJobCount=20000 in /etc/slurm/slurm.conf
  • MaxJobCount = 20000 in scontrol show config

💡 Tip: If the sidecar log shows repeated Failed to reconfigure / Parse error entries, your edited slurm.conf has a syntax error. The error message includes the exact line number and content that failed to parse. The most common cause is two directives on one line.

Step 5: Know What You Can and Cannot Set Through extraConf

The Slinky operator owns a set of directives generated from the Controller and NodeSet specs. Anything in those categories will be overwritten on reconcile. Other Slurm directives — anything documented in the slurm.conf reference and not in the operator-owned list below — can be safely added via the USER SECTION.

Operator-Owned (Do Not Override From extraConf)

These directives are generated by the operator. Setting them in the USER SECTION may appear to work briefly, but will conflict with the operator-injected values and produce parse errors or undefined behavior:

  • ClusterNameSlurmUserSlurmctldHostSlurmctldPortSlurmdPort
  • StateSaveLocationSlurmdSpoolDir
  • AuthTypeCredTypeAuthAltTypes (auth and JWT keys are sourced from Kubernetes Secrets)
  • NodeSet=...PartitionName=all Nodes=ALL ... (operator generates these from NodeSet CRDs)
  • TopologyPluginTopologyParam (managed by Topograph)

Common Safe-to-Set Directives

The following list is illustrative, not exhaustive — most Slurm directives outside the operator-owned set above can be configured here:

  • MaxJobCountMaxArraySizeMaxStepCount
  • PrologEpilogTaskPrologTaskEpilogSrunPrologSrunEpilog
  • PrologFlagsEpilogMsgTime
  • SchedulerParametersSchedulerTypeDefMemPerNodeDefMemPerCPU
  • JobAcctGatherFrequency, additional GresTypes entries
  • DebugFlags

ℹ️ Note: Adding a second PartitionName=... line in your USER SECTION to create additional partitions alongside the operator-managed all partition is theoretically supported by Slurm — the existing NodeSet can be referenced by multiple partitions. ⚠️ Verify the operator does not regenerate against the user-added partition before relying on it in production.

Example

A common real-world use case is wiring in Prolog and Epilog hooks for job-lifecycle logging. The USER SECTION ends up looking like this:

# ===== USER SECTION =====
MaxJobCount=20000
Prolog=/home/scripts/prolog.sh
Epilog=/home/scripts/epilog.sh
SchedulerParameters=defer,bf_continue
# ===== END USER SECTION =====

After patching, scontrol show config will show all four values as live. 

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.