-
Notifications
You must be signed in to change notification settings - Fork 6
Writing slurm scripts
Uppmax uses the job scheduler SLURM to distribute scripts for execution to a high performance computer cluster. Submitting scripts to SLURM on Uppmax is the same regardless of which host you may be connected to, e.g. Rackham, Bianca.
Code snippets are given on this wiki to aid users in running tools. Often many are written specifically with SLURM in mind.
They often will contain the variable $SLURM_NPROCS
or $SLURM_ARRAY_TASK_ID
.
For example:
#!/usr/bin/env bash
module load bioinfo-tools FastQC
CPUS="${SLURM_NPROCS:-6}"
JOB=$SLURM_ARRAY_TASK_ID
DATA_DIR=/path/to/reads
FILES=( "$DATA_DIR"/*_R1.fastq.gz )
FASTQ=${FILES[$JOB]}
fastqc -t "$CPUS" "$FASTQ" "${FASTQ/_R1./_R2.}"
They are often written with the intention of creating one task per set of files to be analysed. For example, say you want to use fastqc
on each pair of fastq files you have in your rawdata
directory.
This example has three sample pairs to analyse (SampleX, SampleY, SampleZ).
$ ls /proj/myproject/rawdata/
SampleX_R1.fastq.gz SampleX_R2.fastq.gz SampleY_R1.fastq.gz SampleY_R2.fastq.gz
SampleZ_R1.fastq.gz SampleZ_R2.fastq.gz
The code snippet above would treat each sample as a separate task to be submitted. Specifically the line
FILES=( "$DATA_DIR"/*_R1.fastq.gz )
makes an array containing the paths to all the R1 fastq files, indexed at 0. Therefore,
${FILES[0]}
contains the string /proj/myproject/rawdata/SampleX_R1.fastq.gz
, ${FILES[1]}
contains the string /proj/myproject/rawdata/SampleY_R1.fastq.gz
, and so on.
In order to run the code snippet above via the SLURM job scheduler, that code should be copy and pasted into a plain text file, and saved as something like my_script.sh
. For the code snippet above, I would name the script run_fastqc.sh
for example.
The code in the script is then modified. For example the line DATA_DIR=/path/to/reads
would be changed to point to the directory where the compressed fastq files to be analysed are stored.
The next step is to make sure your script runs through. You can do this by testing on the first file pair for example, Sample X (Better practice is to make a short test dataset, for example, by subsampling SampleX).
Submit script to SLURM:
sbatch -A snic2019ABC -n 2 -t 02:00:00 -a 0 run_fastqc.sh
# -A is the name of the SNIC project (required).
# -n is the number of cores to use - check the cluster specific documentation to see the maximum.
# -t is the maximum time the task should run for (the default is a short time so set this appropriately)
# -a is the array index of the file to run.
This runs a slurm task for SampleX only since -a 0
selects the file at position 0 in the array $FILES
(Specifically this line of code FASTQ=${FILES[$JOB]}
.
This snippet when run on the command line will show you the array index of the files you're submitting:
FILES=( "$DATA_DIR"/*_R1.fastq.gz )
paste <( printf "Index: %d\n" "${!FILES[@]}" ) <( printf "%s\n" "${FILES[@]}" )
Once you're happy your code runs smoothly for one sample (or test data), you can submit the rest of the files.
sbatch -A snic2019ABC -n 2 -t 02:00:00 -a 1-2 run_fastqc.sh
# -a is changed to 1-2 since there are only the files in indices 1-2 to left to run, i.e. SampleY and SampleZ
# -a can take ranges and/or comma separated values, e.g. it can look something like -a 0-4,9-11,13,15
You can check your jobs are either queued or running using the command:
squeue -u $USER
Specifically on Uppmax clusters, a helper script is installed that gives similar information.
jobinfo -u $USER