Skip to content
This repository has been archived by the owner on Oct 19, 2024. It is now read-only.

Commit

Permalink
Added distributed training docs
Browse files Browse the repository at this point in the history
  • Loading branch information
EricMarcus-ai committed Feb 14, 2024
1 parent b10feba commit 001614e
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 2 deletions.
5 changes: 3 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@

# General information about the project.
project = "ahcore"
copyright = "2023, ahcore contributors"
copyright = "2024, ahcore contributors"
author = "AI for Oncology"

# The version info for the project you're documenting, acts as replacement for
Expand Down Expand Up @@ -143,7 +143,8 @@
# documentation.
#
html_theme_options = {
"repository_url": "https://github.com/NKI-AI/ahcore.git",
"repository_url": "https://github.com/NKI-AI/ahcore",
"path_to_docs": "/docs",
"repository_branch": "main",
"use_issues_button": True,
"use_edit_page_button": True,
Expand Down
53 changes: 53 additions & 0 deletions docs/distributed.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
Distributed Training / Inference
================================

Ahcore is fully compatible with distributed training. Here, we show basic commands to get started with multi-GPU trainings.
Since Ahcore is based on Lightning, we can change the configuration for the Trainer; we can, for example, use a setup such as (which is similar to the ``default_ddp.yaml`` provided in the standard ahcore configs):

.. code-block:: YAML
_target_: pytorch_lightning.Trainer
accelerator: gpu
devices: 2
num_nodes: 1
max_epochs: 1000
strategy: ddp
precision: 32
which will execute on 2 GPUs (devices=2) on 1 node using Distributed Data Parallel (DDP). Launching from the command line can be done using, e.g., torch distributed launch:

.. code-block:: bash
python -m torch.distributed.launch --nproc_per_node=2 --use_env /.../ahcore/tools/train.py data_description=something lit_module=your_module trainer=default_ddp
Note that a simple command without a distributed launch might only detect 1 GPU!

More commonly, Ahcore distributed can be called using SLURM by sbatch files, for instance:

.. code-block:: bash
#!/bin/bash
#SBATCH --job-name=train_ahcore_distributed
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=your_partition
#SBATCH --qos=your_qos
#SBATCH --tasks-per-node=2 # Set equal to number of gpus, See comments below
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16 # will be multiplied with thie tasks-per-node
#SBATCH --mem=100G # Adjust memory to your requirement
#SBATCH --time=12:00:00 # Adjust maximum time (HH:MM:SS)
# Activate your virtual environment if needed
source activate /path/to/your/env/
# Run the training script using srun -- see comments below
srun python /.../ahcore/tools/train.py \
data_description=something \
lit_module=your_module \
trainer=default_ddp
A few subtleties here: the ``--tasks-per-node`` is introduced for proper communication between Lightning and SLURM, we need to set it equal to the number of gpus. See `here <https://github.com/Lightning-AI/pytorch-lightning/blob/1d04c10e2d26c6097794379f44426cfd78bbd1f1/src/lightning/fabric/plugins/environments/slurm.py#L165/>`_.
Furthermore, the python command is preceded by 'srun', which ensures that environments are properly setup; if we don't add this, the code may hang on initializing the different processes (deadlocked).
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ AI for Oncology Core for Computational Pathology

cli
configuration
distributed
model_zoo
contributing
modules
Expand Down

0 comments on commit 001614e

Please sign in to comment.