One of the easiest ways to get started with submitting AI training jobs on Crusoe Cloud is by using our published SLURM solution. SLURM is a widely adopted workload manager used for job scheduling, resource allocation, and queueing in high-performance computing environments.
This solution provides a step-by-step guide to deploying a SLURM cluster on Crusoe using Terraform to provision infrastructure and Ansible playbooks to configure the environment. It also includes an optional Prometheus + Grafana stack for monitoring and observability.
Pre-requisites
- Crusoe CLI configured
- Terraform installed
- Ansible Installed locally
Getting Started
1. Clone the SLURM GitHub repo at https://github.com/crusoecloud/slurm
# git clone https://github.com/crusoecloud/slurm.git
2. In the same directory as the main.tf file, create a terraform.tfvars file with relevant metadata for your Crusoe project and deployment. You can grab the necessary info using both the UI and the CLI. For example, to grab locations and subnets, you can run the following commands
# crusoe locations list
# crusoe networking vpc-subnets list
# crusoe compute vms types
You can also choose to optionally add a a pre-created shared-disk to provide additional Petabyte scale storage alongside the default persistent /home directory powered by NFS. To grab the id of your shared-disk, you can use the CLI and jq to grab the id
# crusoe storage disks list -f json | jq -r '.[] |select(.type |contains("shared")).id'
Once you have all the variables needed, you can fill out the terraform.tfvars file using the following template. Note that in this example, we are provisioning two compute nodes of h200-141gb-sxm-ib.8x. You can configure the head-nodes and login nodes to be highly available, or increase the CPU size if desired. You can also add additional slurm users by filling out their details in the slurm_users variable.
# common configuration
location = "<crusoe-region>"
project_id = "<project-id>"
ssh_public_key_path = "~/path/to/public_key"
vpc_subnet_id = "<vpc-subnet-id>"
# head node
slurm_head_node_count = 1
slurm_head_node_type = "c1a.8x"
# login node
slurm_login_node_count = 1
slurm_login_node_type = "c1a.8x"
# nfs node
slurm_nfs_node_type = "s1a.80x"
slurm_nfs_home_size = "10240GiB"
# slurm-compute-node configuration
slurm_compute_node_type = "h200-141gb-sxm-ib.8x"
slurm_compute_node_ib_partition_id="<ib-partition-id>"
slurm_compute_node_count = 2
slurm_shared_volumes = [{
id = "<shared-disk-id>"
name = "<name-of-shared-disk>"
mode = "read-write"
mount_point = "/data"
}]
# slurm users configuration
slurm_users = [{
name = "user1"
uid = 1001
ssh_pubkey = "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIDQD5doToJjyyq0BH8TDlHZqqVy+kZpuGgJP5gbDanpF piotr.rojek (at) deepsense.ai"
}]
# observability
# enable_observability = true
# grafana_admin_password = "admin123"
Note: If you want to enable observability, uncomment the last two lines, and you will additionally need to create 4 firewall rules for metrics and dashboard access
- Inbound TCP Port 3000 for Grafana Access
- TCP Port 9400 for NVIDIA DCGM Exporter
- TCP Port 9090 for Prometheus access
- TCP Port 9100 for Node Exporter
4. Once you have the terraform.tfvars created, you can execute the terraform script to provision the resources and set up your cluster
terraform init
terraform plan
terraform apply
5. It will take a couple of minutes for terraform and ansible to complete. Once finished, you can ssh to the login node and run sinfo to verify the environment is running, and you should see your compute nodes in an idle state ready to receive jobs with srun or sbatch.
ubuntu@slurm-login-node-0:~$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
batch* up infinite 8 idle slurm-compute-node-[0-1]
6. If using Infiniband, you can run a simple two node NCCL test to validate internode RDMA performance.
srun -N 2 --ntasks-per-node=8 --cpus-per-task=22 --mpi=pmix /opt/nccl-tests/build/all_reduce_perf -b 1M -e 1G -f 2 -g 1
Conclusion
Additional Resources
How to recover SLURM nodes from a drain state
How to recover ssh access when SLURM NFS is down
Comments
0 comments
Please sign in to leave a comment.