diff --git a/intermediate_source/ddp_tutorial.rst b/intermediate_source/ddp_tutorial.rst index cff5105fa54..8de44a9c892 100644 --- a/intermediate_source/ddp_tutorial.rst +++ b/intermediate_source/ddp_tutorial.rst @@ -2,7 +2,7 @@ Getting Started with Distributed Data Parallel ================================================= **Author**: `Shen Li `_ -**Edited by**: `Joe Zhu `_ +**Edited by**: `Joe Zhu `_, `Chirag Pandya `__ .. note:: |edit| View and edit this tutorial in `github `__. @@ -15,24 +15,30 @@ Prerequisites: `DistributedDataParallel `__ -(DDP) implements data parallelism at the module level which can run across -multiple machines. Applications using DDP should spawn multiple processes and -create a single DDP instance per process. DDP uses collective communications in the +(DDP) is a powerful module in PyTorch that allows you to parallelize your model across +multiple machines, making it perfect for large-scale deep learning applications. +To use DDP, you'll need to spawn multiple processes and create a single instance of DDP per process. + +But how does it work? DDP uses collective communications from the `torch.distributed `__ -package to synchronize gradients and buffers. More specifically, DDP registers -an autograd hook for each parameter given by ``model.parameters()`` and the -hook will fire when the corresponding gradient is computed in the backward -pass. Then DDP uses that signal to trigger gradient synchronization across -processes. Please refer to -`DDP design note `__ for more details. +package to synchronize gradients and buffers across all processes. This means that each process will have +its own copy of the model, but they'll all work together to train the model as if it were on a single machine. + +To make this happen, DDP registers an autograd hook for each parameter in the model. +When the backward pass is run, this hook fires and triggers gradient synchronization across all processes. +This ensures that each process has the same gradients, which are then used to update the model. + +For more information on how DDP works and how to use it effectively, be sure to check out the +`DDP design note `__. +With DDP, you can train your models faster and more efficiently than ever before! +The recommended way to use DDP is to spawn one process for each model replica. The model replica can span +multiple devices. DDP processes can be placed on the same machine or across machines. Note that GPU devices +cannot be shared across DDP processes (i.e. one GPU for one DDP process). -The recommended way to use DDP is to spawn one process for each model replica, -where a model replica can span multiple devices. DDP processes can be -placed on the same machine or across machines, but GPU devices cannot be -shared across processes. This tutorial starts from a basic DDP use case and -then demonstrates more advanced use cases including checkpointing models and -combining DDP with model parallel. + +In this tutorial, we'll start with a basic DDP use case and then demonstrate more advanced use cases, +including checkpointing models and combining DDP with model parallel. .. note:: @@ -43,25 +49,22 @@ combining DDP with model parallel. Comparison between ``DataParallel`` and ``DistributedDataParallel`` ------------------------------------------------------------------- -Before we dive in, let's clarify why, despite the added complexity, you would -consider using ``DistributedDataParallel`` over ``DataParallel``: +Before we dive in, let's clarify why you would consider using ``DistributedDataParallel`` +over ``DataParallel``, despite its added complexity: -- First, ``DataParallel`` is single-process, multi-thread, and only works on a - single machine, while ``DistributedDataParallel`` is multi-process and works - for both single- and multi- machine training. ``DataParallel`` is usually - slower than ``DistributedDataParallel`` even on a single machine due to GIL - contention across threads, per-iteration replicated model, and additional - overhead introduced by scattering inputs and gathering outputs. +- First, ``DataParallel`` is single-process, multi-threaded, but it only works on a + single machine. In contrast, ``DistributedDataParallel`` is multi-process and supports + both single- and multi- machine training. + Due to GIL contention across threads, per-iteration replicated model, and additional overhead introduced by + scattering inputs and gathering outputs, ``DataParallel`` is usually + slower than ``DistributedDataParallel`` even on a single machine. - Recall from the `prior tutorial `__ that if your model is too large to fit on a single GPU, you must use **model parallel** to split it across multiple GPUs. ``DistributedDataParallel`` works with - **model parallel**; ``DataParallel`` does not at this time. When DDP is combined + **model parallel**, while ``DataParallel`` does not at this time. When DDP is combined with model parallel, each DDP process would use model parallel, and all processes collectively would use data parallel. -- If your model needs to span multiple machines or if your use case does not fit - into data parallelism paradigm, please see `the RPC API `__ - for more generic distributed training support. Basic Use Case -------------- @@ -141,6 +144,7 @@ different DDP processes starting from different initial model parameter values. optimizer.step() cleanup() + print(f"Finished running basic DDP example on rank {rank}.") def run_demo(demo_fn, world_size): @@ -149,12 +153,13 @@ different DDP processes starting from different initial model parameter values. nprocs=world_size, join=True) + As you can see, DDP wraps lower-level distributed communication details and provides a clean API as if it were a local model. Gradient synchronization communications take place during the backward pass and overlap with the backward computation. When the ``backward()`` returns, ``param.grad`` already contains the synchronized gradient tensor. For basic use cases, DDP only -requires a few more LoCs to set up the process group. When applying DDP to more +requires a few more lines of code to set up the process group. When applying DDP to more advanced use cases, some caveats require caution. Skewed Processing Speeds @@ -179,13 +184,14 @@ It's common to use ``torch.save`` and ``torch.load`` to checkpoint modules during training and recover from checkpoints. See `SAVING AND LOADING MODELS `__ for more details. When using DDP, one optimization is to save the model in -only one process and then load it to all processes, reducing write overhead. -This is correct because all processes start from the same parameters and +only one process and then load it on all processes, reducing write overhead. +This works because all processes start from the same parameters and gradients are synchronized in backward passes, and hence optimizers should keep -setting parameters to the same values. If you use this optimization, make sure no process starts +setting parameters to the same values. +If you use this optimization (i.e. save on one process but restore on all), make sure no process starts loading before the saving is finished. Additionally, when loading the module, you need to provide an appropriate ``map_location`` -argument to prevent a process from stepping into others' devices. If ``map_location`` +argument to prevent processes from stepping into others' devices. If ``map_location`` is missing, ``torch.load`` will first load the module to CPU and then copy each parameter to where it was saved, which would result in all processes on the same machine using the same set of devices. For more advanced failure recovery @@ -218,7 +224,7 @@ and elasticity support, please refer to `TorchElastic `__ command +One can then run a `torch elastic/torchrun `__ command on all nodes to initialize the DDP job created above: .. code:: bash torchrun --nnodes=2 --nproc_per_node=8 --rdzv_id=100 --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:29400 elastic_ddp.py -We are running the DDP script on two hosts, and each host we run with 8 processes, aka, we -are running it on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes. +In the example above, we are running the DDP script on two hosts and we run with 8 processes on each host. That is, we +are running this job on 16 GPUs. Note that ``$MASTER_ADDR`` must be the same across all nodes. -Here torchrun will launch 8 process and invoke ``elastic_ddp.py`` -on each process on the node it is launched on, but user also needs to apply cluster +Here ``torchrun`` will launch 8 processes and invoke ``elastic_ddp.py`` +on each process on the node it is launched on, but user also needs to apply cluster management tools like slurm to actually run this command on 2 nodes. For example, on a SLURM enabled cluster, we can write a script to run the command above @@ -368,8 +377,8 @@ and set ``MASTER_ADDR`` as: Then we can just run this script using the SLURM command: ``srun --nodes=2 ./torchrun_script.sh``. -Of course, this is just an example; you can choose your own cluster scheduling tools -to initiate the torchrun job. -For more information about Elastic run, one can check this -`quick start document `__ to learn more. +This is just an example; you can choose your own cluster scheduling tools to initiate the ``torchrun`` job. + +For more information about Elastic run, please see the +`quick start document `__.