Getting Started with SLURM on Crusoe Cloud

Last Updated: Oct 16, 2025

Overview

One of the easiest ways to get started with submitting AI training jobs on Crusoe Cloud is by using our published SLURM solution. SLURM is a widely adopted workload manager used for job scheduling, resource allocation, and queueing in high-performance computing environments.

This solution provides a step-by-step guide to deploying a SLURM cluster on Crusoe using Terraform to provision infrastructure and Ansible playbooks to configure the environment. It also includes a Prometheus + Grafana stack for monitoring and observability.

Prerequisites

Ensure you have the following installed on your local machine

Crusoe CLI
Git
Terraform
Ansible

Step-by-Step Instructions

1. Clone the SLURM Repository

Clone the SLURM GitHub repo at https://github.com/crusoecloud/slurm and cd into the slurm directory
```
$ git clone https://github.com/crusoecloud/slurm.git
$ cd slurm
```

2. Create the terraform.tfvars file

In the same directory as the main.tf file, create a terraform.tfvars file with relevant metadata for your Crusoe project and deployment. An example terraform.tfvars is given below.

# common configuration
location = "<crusoe-region>"
project_id = "<project-id>"
ssh_public_key_path = "<path_to_public_key>"
vpc_subnet_id = "<vpc-subnet-id>"

# head node
slurm_head_node_count = 1
slurm_head_node_type = "c1a.8x"  

# login node
slurm_login_node_count = 1
slurm_login_node_type = "c1a.8x"

# nfs node
slurm_nfs_node_type = "s1a.80x"
slurm_nfs_home_size = "10240GiB"

# slurm-compute-node configuration
slurm_compute_node_type = "h200-141gb-sxm-ib.8x"
slurm_compute_node_ib_partition_id="<ib-partition-id>"
slurm_compute_node_count = 2

# To add shared volume 
slurm_shared_volumes = [{
id = "<shared-disk-id>"
name = "<name-of-shared-disk>"
mode = "read-write"
mount_point = "/data"
}]

# slurm users configuration
slurm_users = [{
 name = "user1"
 uid = 1001
 ssh_pubkey = "<ssh_public_key>"
}]

# observability - comment out the lines below if not required
enable_observability = true
# change this to a more secure value
grafana_admin_password = "grafana_admin_password"

<crusoe-region>, <project-id>, <vpc-subnet-id> and <ib-partition-id> can be found in both the Crusoe Console UI and the CLI. For example, you can use the following CLI commands to retrieve this information.
```
$ crusoe locations list 
$ crusoe projects list
$ crusoe networking vpc-subnets list
$ crusoe networking ib-partitions list
```
Update the node type values to match your desired instance configurations. We are provisioning two compute nodes of h200-141gb-sxm-ib.8x. You can configure the head nodes and login nodes to be highly available, or increase the CPU size if desired. Run the following command to get a list of node types available.
```
$ crusoe compute vms types
```
You can also add additional slurm users by filling out their details in the slurm_users variable.
Replace <path_to_public_key> with the file path to your SSH public key, and <ssh_public_key> with the actual contents of your SSH public key.
You can optionally add a a pre-created shared-disk (code block under # To add shared volume) to provide additional Petabyte scale storage alongside the default persistent /home directory mounted with NFS. To find the name and id of your shared-disk, you can use the CLI and jq utility:
```
$ crusoe storage disks list -f json | jq -r '.[] |select(.type |contains("shared"))'
```

3. Deploy the terraform stack

After creating terraform.tfvars, execute the terraform script to provision the resources and set up your cluster:
```
$ terraform init
$ terraform plan 
$ terraform apply 
```

4. Verify Environment Setup

It could take several minutes for terraform and ansible to complete. Once finished, ssh to the login node as user 'ubuntu' or any of the users specified in your tfvars and run sinfo to verify that the environment is running. You should see your compute nodes in an idle state ready to receive jobs with srun or sbatch.
```
ubuntu@slurm-login-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 8 idle slurm-compute-node-[0-1]
```

If using Infiniband, you can run a simple two node NCCL test to validate internode RDMA performance.

srun -N 2 --ntasks-per-node=8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1

5. Access Grafana

To access the Grafana dashboard, create a firewall rule in your Crusoe Console web UI or CLI that allows inbound access to TCP port 3000 of your SLURM head node(s).
Access Grafana dashboards at https://<public IP address of SLURM head node>:3000 with username admin and password as grafana_admin_password. Note: Make sure to use the exact password specified in your tfvars file.
The Grafana web UI is encrypted using a self-signed certificate, but if you want to secure it for production use you can supply your own key and a trusted certificate (for eg., from Lets Encrypt) and configure Grafana on the SLURM head node(s) as described here. To apply changes to the Grafana configuration, run systemctl restart grafana-service on the head node(s).

Conclusion

By following these steps, you’re now equipped to launch large-scale training jobs using a robust, production-grade SLURM setup on Crusoe Cloud. This solution provides a flexible foundation for managing GPU workloads efficiently, with support for observability, automation, and customization to fit your team's needs as you scale.

Additional Resources

Related to

slurm terraform ansible solution

Was this article helpful?

0 out of 0 found this helpful

Comments

0 comments

Article is closed for comments.

Overview

Prerequisites

Step-by-Step Instructions

Conclusion

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Recently Viewed

Comments

Getting Started with SLURM on Crusoe Cloud

Overview

Prerequisites

Step-by-Step Instructions

Conclusion

Additional Resources

Related to

Was this article helpful?

Still need help?

Related Articles

Related articles

Recently Viewed

Comments