Introduction
Crusoe Managed Slurm renders slurm.conf automatically from a Kubernetes CRD. The Crusoe Slurm Operator and the upstream Slinky operator both regenerate the file on every reconcile, so direct edits on the controller pod (e.g., vim /etc/slurm/slurm.conf) get reverted within seconds.
To persistently change Slurm settings - adding a Prolog/Epilog, raising MaxJobCount, tuning SchedulerParameters, etc. - you append directives to the extraConf field on the Slinky Controller CRD. The operator writes them into the rendered slurm.conf between marker comments, and a sidecar in the controller pod runs scontrol reconfigure automatically when the ConfigMap changes.
This article shows the supported workflow and the safest mechanics for applying it.
Prerequisites
- Crusoe Managed Slurm Cluster Already Provisioned
- Kubeconfig Access to the Underlying CMK Cluster
- A slurm.conf Directive You Want to Set (e.g., Prolog, MaxJobCount, SchedulerParameters)
Instructions
Step 1: Understand Where the Source of Truth Lives
The rendered slurm.conf on the controller pod is generated by the Slinky operator from the Controller CRD’s spec.extraConf field. Inspect the CRD and the current value:
kubectl get controllers.slinky.slurm.net -n slurm
kubectl get controllers.slinky.slurm.net <name> -n slurm -o jsonpath='{.spec.extraConf}'The rendered file on disk has a clearly marked operator-injected block:
### EXTRA CONFIG ### # THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR # ===============================START====================================== SlurmctldDebug=debug5 ... PartitionName=all Nodes=ALL Default=YES MaxTime=UNLIMITED State=UP # ================================END=======================================
Your custom directives go in the empty space above the START marker — that section is preserved across reconciles. Anything you put between START and END will be overwritten on the next operator pass.
⚠️ Warning: Do not exec into the slurmctld pod and edit
/etc/slurm/slurm.confdirectly. The file is mounted from a projected ConfigMap, and the operator regenerates it on every reconcile. Edits made on disk will disappear.
Step 2: Choose Your Edit Mechanism
kubectl edit controllers.slinky.slurm.net ... works but is fragile for extraConf:
- The CRD often stores
extraConfas a YAML flow scalar ("...\n...") rather than a literal block (|), which makes manual editing error-prone — every newline and quote must be exactly right or the rendered file fails to parse. - Terminals with bracketed-paste or auto-indent (zsh, oh-my-zsh) often corrupt heredocs and multi-line YAML when copy-pasting.
The reliable pattern is get → fix → patch:
- Export the current
extraConfto a plain text file. - Edit the text file with whatever editor handles your terminal cleanly.
- Patch the resource using
kubectl patch --type=merge --patch-file, which sends the file contents byte-for-byte to the API server.
Step 3: Apply a Worked Example - Raise MaxJobCount
Export the current value to a file:
kubectl get controllers.slinky.slurm.net <name> -n slurm -o jsonpath='{.spec.extraConf}' > /tmp/current.txt
cat /tmp/current.txtEdit /tmp/current.txt with any editor. Add your directive above the operator’s START marker:
# ===== USER SECTION ===== MaxJobCount=20000 # ===== END USER SECTION ===== # THE FOLLOWING SETTINGS ARE AUTOMATICALLY INJECTED BY CRUSOE SLURM OPERATOR # ===============================START====================================== ... (leave this block alone) ... # ================================END=======================================
Build a JSON merge patch from the edited file:
python3 -c "import json;print(json.dumps({'spec':{'extraConf':open('/tmp/current.txt').read()}}))" > /tmp/patch.jsonApply it:
kubectl patch controllers.slinky.slurm.net <name> -n slurm --type=merge --patch-file=/tmp/patch.json
⚠️ Warning: Each Slurm directive must be on its own line. If two directives end up on the same line (e.g.,
MaxJobCount=20000 Prolog=/...), the renderedslurm.confwill fail to parse, the reconfigure sidecar will keep retrying, and the cluster will silently continue running the previous good config.
Step 4: Verify the Change Took Effect
Within roughly 10 seconds the operator regenerates the ConfigMap, kubelet swaps the projected file atomically inside the controller pod, and the reconfigure sidecar runs scontrol reconfigure. Confirm each step:
# Sidecar fired successfully kubectl logs <controller-pod> -n slurm -c reconfigure --tail=10 # Rendered file picked up your change kubectl exec <controller-pod> -n slurm -c slurmctld -- cat /etc/slurm/slurm.conf | grep -A1 "USER SECTION" # Live slurmctld config sees it kubectl exec <controller-pod> -n slurm -c slurmctld -- scontrol show config | grep -i MaxJobCount
You should see:
- A fresh
Reconfiguring Slurm... / SUCCESSentry in the sidecar log -
MaxJobCount=20000in/etc/slurm/slurm.conf -
MaxJobCount = 20000inscontrol show config
💡 Tip: If the sidecar log shows repeated
Failed to reconfigure / Parse errorentries, your editedslurm.confhas a syntax error. The error message includes the exact line number and content that failed to parse. The most common cause is two directives on one line.
Step 5: Know What You Can and Cannot Set Through extraConf
The Slinky operator owns a set of directives generated from the Controller and NodeSet specs. Anything in those categories will be overwritten on reconcile. Other Slurm directives — anything documented in the slurm.conf reference and not in the operator-owned list below — can be safely added via the USER SECTION.
Operator-Owned (Do Not Override From extraConf)
These directives are generated by the operator. Setting them in the USER SECTION may appear to work briefly, but will conflict with the operator-injected values and produce parse errors or undefined behavior:
-
ClusterName,SlurmUser,SlurmctldHost,SlurmctldPort,SlurmdPort -
StateSaveLocation,SlurmdSpoolDir -
AuthType,CredType,AuthAltTypes(auth and JWT keys are sourced from Kubernetes Secrets) -
NodeSet=...,PartitionName=all Nodes=ALL ...(operator generates these from NodeSet CRDs) -
TopologyPlugin,TopologyParam(managed by Topograph)
Common Safe-to-Set Directives
The following list is illustrative, not exhaustive — most Slurm directives outside the operator-owned set above can be configured here:
-
MaxJobCount,MaxArraySize,MaxStepCount -
Prolog,Epilog,TaskProlog,TaskEpilog,SrunProlog,SrunEpilog -
PrologFlags,EpilogMsgTime -
SchedulerParameters,SchedulerType,DefMemPerNode,DefMemPerCPU -
JobAcctGatherFrequency, additionalGresTypesentries DebugFlags
ℹ️ Note: Adding a second
PartitionName=...line in your USER SECTION to create additional partitions alongside the operator-managedallpartition is theoretically supported by Slurm — the existing NodeSet can be referenced by multiple partitions. ⚠️ Verify the operator does not regenerate against the user-added partition before relying on it in production.
Example
A common real-world use case is wiring in Prolog and Epilog hooks for job-lifecycle logging. The USER SECTION ends up looking like this:
# ===== USER SECTION ===== MaxJobCount=20000 Prolog=/home/scripts/prolog.sh Epilog=/home/scripts/epilog.sh SchedulerParameters=defer,bf_continue # ===== END USER SECTION =====
After patching, scontrol show config will show all four values as live.