Skip to content

Commit

Permalink
Add DeepSpeed Stage 1 + doc improvements for model parallel (#8974)
Browse files Browse the repository at this point in the history
* Add stage 1 support + small doc improvements

* Add CHANGELOG.md
  • Loading branch information
Sean Naren authored Aug 18, 2021
1 parent 38ceb89 commit c6b6888
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 5 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
- Added `CheckpointIO` to expose checkpoint IO from training type plugin ([#8743](https://github.com/PyTorchLightning/pytorch-lightning/pull/8743))


- Added DeepSpeed Stage 1 support ([#8974](https://github.com/PyTorchLightning/pytorch-lightning/pull/8974))


### Changed

- Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770))
Expand Down
30 changes: 25 additions & 5 deletions docs/source/advanced/advanced_gpu.rst
Original file line number Diff line number Diff line change
Expand Up @@ -202,13 +202,15 @@ DeepSpeed also offers lower level training optimizations, and efficient optimize

Below is a summary of all the configurations of DeepSpeed.

* :ref:`deepspeed-zero-stage-2` - **Shard optimizer states and gradients**, remains at parity with DDP with memory improvement
* :ref:`deepspeed-zero-stage-1` - **Shard optimizer states**, remains at speed parity with DDP whilst providing memory improvement

* :ref:`deepspeed-zero-stage-2-offload` - **Offload optimizer states and gradients to CPU**. Increases communication, but significant memory improvement
* :ref:`deepspeed-zero-stage-2` - **Shard optimizer states and gradients**, remains at speed parity with DDP whilst providing even more memory improvement

* :ref:`deepspeed-zero-stage-3` - **Shard optimizer states, gradients, (Optional) activations and parameters**. Increases communication volume, but even more memory improvement
* :ref:`deepspeed-zero-stage-2-offload` - **Offload optimizer states and gradients to CPU**. Increases distributed communication volume and GPU-CPU device transfer, but provides significant memory improvement

* :ref:`deepspeed-zero-stage-3-offload` - **Offload optimizer states, gradients, (Optional) activations and parameters to CPU**. Increases communication, but even more signficant memory improvement.
* :ref:`deepspeed-zero-stage-3` - **Shard optimizer states, gradients, parameters and optionally activations**. Increases distributed communication volume, but provides even more memory improvement

* :ref:`deepspeed-zero-stage-3-offload` - **Offload optimizer states, gradients, parameters and optionally activations to CPU**. Increases distributed communication volume and GPU-CPU device transfer, but even more signficant memory improvement.

* :ref:`deepspeed-activation-checkpointing` - **Free activations after forward pass**. Increases computation, but provides memory improvement for all stages.

Expand All @@ -227,12 +229,30 @@ If you run into an issue with the install or later in training, ensure that the
When saving a checkpoint we rely on DeepSpeed which saves a directory containing the model and various components.


.. _deepspeed-zero-stage-1:

DeepSpeed ZeRO Stage 1
""""""""""""""""""""""

`DeepSpeed ZeRO Stage 1 <https://www.deepspeed.ai/tutorials/zero/#zero-overview>`_ partitions your optimizer states (Stage 1) across your GPUs to reduce memory.

It is recommended to skip Stage 1 and use Stage 2, which comes with larger memory improvements and still remains efficient. Stage 1 is useful to pair with certain optimizations such as `Torch ORT <https://github.com/pytorch/ort>`__.

.. code-block:: python
from pytorch_lightning import Trainer
model = MyModel()
trainer = Trainer(gpus=4, plugins="deepspeed_stage_1", precision=16)
trainer.fit(model)
.. _deepspeed-zero-stage-2:

DeepSpeed ZeRO Stage 2
""""""""""""""""""""""

By default, we enable `DeepSpeed ZeRO Stage 2 <https://www.deepspeed.ai/tutorials/zero/#zero-overview>`_, which partitions your optimizer states (Stage 1) and your gradients (Stage 2) across your GPUs to reduce memory. In most cases, this is more efficient or at parity with DDP, primarily due to the optimized custom communications written by the DeepSpeed team.
`DeepSpeed ZeRO Stage 2 <https://www.deepspeed.ai/tutorials/zero/#zero-overview>`_ partitions your optimizer states (Stage 1) and your gradients (Stage 2) across your GPUs to reduce memory. In most cases, this is more efficient or at parity with DDP, primarily due to the optimized custom communications written by the DeepSpeed team.
As a result, benefits can also be seen on a single GPU. Do note that the default bucket sizes allocate around ``3.6GB`` of VRAM to use during distributed communications, which can be tweaked when instantiating the plugin described in a few sections below.

.. code-block:: python
Expand Down
1 change: 1 addition & 0 deletions pytorch_lightning/plugins/training_type/deepspeed.py
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,7 @@ def update_global_step(self, total_batch_idx: int, current_global_step: int) ->
@classmethod
def register_plugins(cls, plugin_registry: Dict) -> None:
plugin_registry.register("deepspeed", cls, description="Default DeepSpeed Plugin")
plugin_registry.register("deepspeed_stage_1", cls, description="DeepSpeed with ZeRO Stage 1 enabled", stage=1)
plugin_registry.register("deepspeed_stage_2", cls, description="DeepSpeed with ZeRO Stage 2 enabled", stage=2)
plugin_registry.register(
"deepspeed_stage_2_offload",
Expand Down
1 change: 1 addition & 0 deletions tests/plugins/test_plugins_registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,7 @@ def __init__(self, param1, param2):
"plugin_name, init_params",
[
("deepspeed", {}),
("deepspeed_stage_1", {"stage": 1}),
("deepspeed_stage_2", {"stage": 2}),
("deepspeed_stage_2_offload", {"stage": 2, "offload_optimizer": True}),
("deepspeed_stage_3", {"stage": 3}),
Expand Down

0 comments on commit c6b6888

Please sign in to comment.