Skip to content

Files

Latest commit

Jun 5, 2024
764f5f0 · Jun 5, 2024

History

History
302 lines (206 loc) · 5.02 KB

Slurm.md

File metadata and controls

302 lines (206 loc) · 5.02 KB

Simple Linux Utility Resource Manager (SLURM)

  • Introduction to slurm
  • Connecting to the cluster
  • sinfo command: (Nodes,and Partition information)
  • Interaction with slurm
    • batch jobs (job scripts)
    • interactive jobs
  • Other slurm commands (managing jobs)
  • Examples
  • Slurm job arrays

Slurm

Slurm is a job scheduler and resource manager. It is used to manage resources on a cluster and to schedule jobs on the cluster.

Slurm manages:

  • Time
  • CPU cores
  • Memory
  • GPUs
  • Nodes
  • Jobs

A user asks slurm for certain resources and provides a work to do. Slurm will schedule and allocate those resources and report the job status back to the user.

Connecting to the cluster

You can either connect to the cluster using ssh in a terminal but this limits what you can do in terms of writting code.

ssh username@ozerlabs

sinfo command

#Get information about partitions and its nodes and resources 
# PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
sinfo 

## all nodes & partitions
sinfo --all

## all nodes & partitions with more details
sinfo --all --long

#specific partition
sinfo --partition=main

#specific node
sinfo --node=nova101

#useful info
sinfo -Nel

sinfo -a "%P %D %N %G"

sinfo -h -N $partition $hostlist $statelist -O "NodeList,Partition,CPUs,Memory,Gres,GresUsed,Cluster,User"


#get more help
sinfo --help

man sinfo #useful

Interaction with slurm

Slurm can be interacted with in two ways:

  • sbatch command
  • interactive jobs

sbatch command

The sbatch command is used to submit a job to slurm. The job is submitted as a job script.

sbatch [options] myJobScript.sh

In the job script we describe the resources that we need using directives. refer to Script generator for a quick start.

#!/bin/bash

#SBATCH --job-name=jobName
#SBATCH --account=users
#SBATCH --partition=main
#SBATCH --nodes=1
#SBATCH --partition=main
#SBATCH --ntasks-per-node=1
##SBATCH --ntasks=4
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --time=00:01:00
#SBATCH --partition=short
#SBATCH --output=jobName.out
#SBATCH --error=jobName.err

# load modules
module load python/3.7.3
# environment variables
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

# run program
srun python3 myProgram.py

#program with arguments
srun python3 myProgram.py arg1 arg2 arg3

salloc

allocates resources in an interactive session (a shell).

#salloc
salloc -n 1 -t 3:00:00 --mem-per-cpu 3G --pty bash

#sinteractive
sinteractive -N 1 -n 1 --nodelist=nodename --gres=gpu:1 -J int_jobs_name

Other slurm commands (managing jobs)

#show all jobs
squeue 

#user specific
squeue -u username

#jobs in a specific partition
squeue -p main

#jobs in a specific state
squeue -t running

## cancelling jobs
scancel jobID

#cancel all jobs
scancel -u username

#cancel job is a specific state

scancel -u ndigande -t running

## get help 
man scancel
man squeue 

## sacct 
#show account information
sacct 

#user specific
sacct -u username

#in specific time range
sacct -u username -S 2020-01-01 -E 2020-01-31


## admin info

## scontrol
scontrol show nodes|partition|job

#sacctmgr
sacctmgr list user

#sreport
sreport cluster AccountUtilizationByUser start=2020-01-01 end=2020-01-31

Examples

#!/bin/bash

#SBATCH --job-name=singlecpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

# Your script goes here
sleep 30
echo "hello"
#!/bin/bash

#SBATCH --job-name=singlecputasks
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1

# Your script goes here
srun --ntasks=1 echo "I'm task 1"
srun --ntasks=1 echo "I'm task 2"
#!/bin/bash

#SBATCH --job-name=multithreaded
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8

# Your script goes here
mycommand --threads 8
#!/bin/bash

#SBATCH --job-name=multithreadedtasks
#SBATCH --nodes=4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=4

# Your script goes here
srun --ntasks=1 mycommand1 --threads 4
srun --ntasks=1 mycommand2 --threads 4
srun --ntasks=1 mycommand3 --threads 4
srun --ntasks=1 mycommand4 --threads 4
#!/bin/bash

#SBATCH --job-name=simplempi
#SBATCH --ntasks=16

# Your script goes here
mpirun myscript
#!/bin/bash

#SBATCH --job-name=nodempi
#SBATCH --ntasks=16
#SBATCH --ntasks-per-node=8

# Your script goes here
mpirun myscript

Most Used Commands

SINFO


sinfo

sinfo -Nel

sinfo -a "%P %D %N %G"

sinfo -h -N $partition $hostlist $statelist -O "NodeList,Partition,CPUs,Memory,Gres,GresUsed"


squeue

squeue

saccount

sacct

jobs

scancel #####

references