Skip to content

Commit

Permalink
[docs] Add docs for non-SLURM cluster setup (#5754)
Browse files Browse the repository at this point in the history
* Add docs for non-slurm cluster setup

* Apply suggestions from code review

Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>

* Update docs/source/cluster.rst

Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>

* Update docs/source/cluster.rst

Co-authored-by: Alexander <alexander@reshytko.com>
Co-authored-by: chaton <thomas@grid.ai>
Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Co-authored-by: Adrian Wälchli <aedu.waelchli@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Rohit Gupta <rohitgr1998@gmail.com>
  • Loading branch information
7 people authored Feb 4, 2021
1 parent 90a813f commit 0742443
Show file tree
Hide file tree
Showing 3 changed files with 61 additions and 0 deletions.
3 changes: 3 additions & 0 deletions docs/source/accelerators.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@

.. _accelerators:

############
Accelerators
############
Expand Down
57 changes: 57 additions & 0 deletions docs/source/cluster.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@

.. _non-slurm:

Computing cluster
=================

With Lightning it is easy to run your training script on a computing cluster without almost any modifications to the script.
This guide shows how to run a training job on a general purpose cluster.

Also, check :ref:`accelerators` as a new and more general approach to a cluster setup.

--------


Cluster setup
-------------

To setup a multi-node computing cluster you need:

1) Multiple computers with PyTorch Lightning installed
2) A network connectivity between them with firewall rules that allow traffic flow on a specified *MASTER_PORT*.
3) Defined environment variables on each node required for the PyTorch Lightning multi-node distributed training

PyTorch Lightning follows the design of `PyTorch distributed communication package <https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization>`_. and requires the following environment variables to be defined on each node:

- *MASTER_PORT* - required; has to be a free port on machine with NODE_RANK 0
- *MASTER_ADDR* - required (except for NODE_RANK 0); address of NODE_RANK 0 node
- *WORLD_SIZE* - required; how many nodes are in the cluster
- *NODE_RANK* - required; id of the node in the cluster


Training script design
----------------------

To train a model using multiple nodes, do the following:

1. Design your :ref:`lightning_module` (no need to add anything specific here).

2. Enable DDP in the trainer

.. code-block:: python
# train on 32 GPUs across 4 nodes
trainer = Trainer(gpus=8, num_nodes=4, accelerator='ddp')
Submit a job to the cluster
---------------------------

To submit a training job to the cluster you need to run the same training script on each node of the cluster.
This means that you need to:

1. Copy all third-party libraries to each node (usually means - distribute requirements.txt file and install it).

2. Copy all your import dependencies and the script itself to each node.

3. Run the script on each node.
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,7 @@ PyTorch Lightning Documentation
cloud_training
amp
slurm
cluster
child_modules
debugging
loggers
Expand Down

0 comments on commit 0742443

Please sign in to comment.