DVL Slurm Submit Tutorial

This repository demonstrates the submission of Slurm jobs with an *.sbatch file or the Submitit Python tool. To this end, we provide a PyTorch Lighting MNIST example.

In order to submit the example in src/train.py follow:

Install Python packages with pip3 install -r requirements.txt
Submit to Slurm with sbatch slurm_submit.sbatch

The submission configuration is located on the top of an *.sbatch script. Further configuration options can be found here.

Multi-GPU training with Submitit

The Submitit tool provides a Python interface for Slurm job submissions and facilitates particularly the submission of multi-gpu jobs. The example of this tutorial can be submitted with the following command:

python src/slurm_submit.py \
    --ngpus 1 \
    --cluster slurm \
    --output_dir logs/mnist_example

Setting --cluster debug starts the job locally and allows for testing/debugging of your code. The log files are written under /storage/slurm/{USER}/runs. Submitit creates *.sbatch scripts, log and error files for each GPU.

Preemption

Slurm supports job preemption, the act of "stopping" one or more "low-priority" jobs to let a "high-priority" job run. When a job that can preempt others has allocated resources that are already allocated to one or more jobs that could be preempted by the first job, the preemptable job(s) are preempted. Offcial Slurm webpage

The PyTorch Lighting framework already handles most of the bookkeeping (model saving, logging, and resuming) on its own. However, for this example we very briefly demonstrate how a codebase can be adapted to check for an existing model and reload it upon preemption, i.e., restart.

In order to manually trigger a preemption execute: scontrol requeue job_id. The job will get requeued and once it restarts should resume training. You can test preemption and make yourself familiar with the example code provided with this tutorial.

Each project has a unique structure with different frameworks, visualizations (Visdom vs. TensorBoard) and ways of loading configuration parameters. Therefore it is required for you do adapt your code accordingly. For a full resumption of your training you should consider the following:

Model state
Optimizer and scheduler states
Number of epochs
Visualization

We might add examples for specific resumption scenarios.

Tips and tricks

Show used GPUs per node (Each of our 3 partitions lists each node separately): sinfo -N -O nodelist,gresused:100
Direct node/gpu access (for debugging only) with the --pty flag. For example with srun --pty --nodelist=node13 --gres=gpu:1 zsh.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
logs		logs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
slurm_submit.sbatch		slurm_submit.sbatch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DVL Slurm Submit Tutorial

Multi-GPU training with Submitit

Preemption

Tips and tricks

About

Releases

Packages

Languages

dvl-tum/slurm_submit

Folders and files

Latest commit

History

Repository files navigation

DVL Slurm Submit Tutorial

Multi-GPU training with Submitit

Preemption

Tips and tricks

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages