Deprecation notice

As of May 14, 2024, we have deprecated this solution in favor of github.com/crusoecloud/slurm. This solution will remain available in a read-only mode; however, we are not able to provide ongoing maintenance for this solution.

Create an Autoscaled-enabled SLURM Cluster on Crusoe

This is a reference design implementation of SLURM on Crusoe Cloud. This implementation has support for multiple paritions and specific nodegroups within those partitions. The cluster also has support to a cluster autoscaler that will provision instances on Crusoe based on demand on the cluster. The terraform script main.tf is the main entry point which will just provision the headnode and using the SLURM Power Plugin will start additional compute nodes based on jobs submitted to the headnode.

Description of the Architecture

The terraform script will simply provision a headnode, the headnode-bootstrap.shscript will perform the following:

Will scan for number of ephemeral drives and mount it as RAID0 for number of drives > 1 at mount point /raid0 for instances with a single nvme local epehmeral drive it will be mounted as /nvme and the scratch directory will inside that path
A NFS server is also setup at /nfs/slurm which provides the SLURM binaries, libraries and helper code to the ephemeral compute nodes
Download and install SLURM source tree. The SLURM version is controlled by the bootstrap script to ensure its supported on Crusoe. Changing the version in the repo is NOT supported, unless is validated by Crusoe.

Support for NVIDIA Enroot/Pyxis

Included in the deployment is support for enroot and Pyxis. Purpose built to support native container orchestration within SLURM to run container images across the cluster. All enroot images are on the /scratch directory of each node in the cluster. Adding credentials to access various registries can be done by editing a $HOME/enroot/.credentials file.

Monitoring

The headnode is hosting a Telegraf-Prometheus-Grafana(TPG)-stack, and each worker runs Telegraf and creates a /metrics endpoint from which the headnode Prometheus will poll.

Deployment

Step 1. Install Terraform On your client machine where you deploy the headnode of the cluster install Terraform following the instructions here.

Step 2. Install the Crusoe Cloud CLI Install the Crusoe Cloud ClI following these instructions, setup the authentication layer by creating ssh keys and API tokens.

Step 3. Clone repo and create a variables.tf File

git clone https://github.com/crusoecloud/crusoe-hpc-slurm.git
cd crusoe-hpc-slurm

Your variables.tf contains the following:

variable "access_key" {
   description = "Crusoe API Access Key"
   type        = string
   default     = "<ACCESS_KEY>"
 }
variable "secret_key" {
   description = "Crusoe API Secret Key"
   type        = string
   default     = "<SECRET_KEY>"
 }

Step 4. In the main.tf file replace the local values with provide a path for the private ssh key and the string of the public key. And choose an instance type for the headnode

locals {
  my_ssh_privkey_path="/Users/amrragab/.ssh/id_ed25519"
  my_ssh_pubkey="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIIdc3Aaj8RP7ru1oSxUuehTRkpYfvxTxpvyJEZqlqyze amrragab@MBP-Amr-Ragab.local"
  headnode_instance_type="a100-80gb.1x"
}

Step 5. Execute the terraform script

terraform init
terraform plan
terraform apply

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.terraform/providers/registry.terraform.io/crusoecloud/crusoe		.terraform/providers/registry.terraform.io/crusoecloud/crusoe
enroot		enroot
imgs		imgs
monitoring		monitoring
.gitignore		.gitignore
.terraform.lock.hcl		.terraform.lock.hcl
LICENSE		LICENSE
README.md		README.md
headnode-bootstrap.sh		headnode-bootstrap.sh
main.tf		main.tf
slurm-crusoe-shutdown.sh		slurm-crusoe-shutdown.sh
slurm-crusoe-startup.sh		slurm-crusoe-startup.sh
slurm.conf		slurm.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deprecation notice

Create an Autoscaled-enabled SLURM Cluster on Crusoe

Description of the Architecture

Support for NVIDIA Enroot/Pyxis

Monitoring

Deployment

About

Contributors 3

Languages

License

crusoecloud/crusoe-hpc-slurm

Folders and files

Latest commit

History

Repository files navigation

Deprecation notice

Create an Autoscaled-enabled SLURM Cluster on Crusoe

Description of the Architecture

Support for NVIDIA Enroot/Pyxis

Monitoring

Deployment

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages