diff --git a/CHANGELOG.md b/CHANGELOG.md index 9ff44b7d30770..99996c6281938 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -54,6 +54,9 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/). - Added `CheckpointIO` to expose checkpoint IO from training type plugin ([#8743](https://github.com/PyTorchLightning/pytorch-lightning/pull/8743)) +- Added DeepSpeed Stage 1 support ([#8974](https://github.com/PyTorchLightning/pytorch-lightning/pull/8974)) + + ### Changed - Parsing of the `gpus` Trainer argument has changed: `gpus="n"` (str) no longer selects the GPU index n and instead selects the first n devices. ([#8770](https://github.com/PyTorchLightning/pytorch-lightning/pull/8770)) diff --git a/docs/source/advanced/advanced_gpu.rst b/docs/source/advanced/advanced_gpu.rst index 77e27457aa16b..9415ad2e16948 100644 --- a/docs/source/advanced/advanced_gpu.rst +++ b/docs/source/advanced/advanced_gpu.rst @@ -202,13 +202,15 @@ DeepSpeed also offers lower level training optimizations, and efficient optimize Below is a summary of all the configurations of DeepSpeed. -* :ref:`deepspeed-zero-stage-2` - **Shard optimizer states and gradients**, remains at parity with DDP with memory improvement +* :ref:`deepspeed-zero-stage-1` - **Shard optimizer states**, remains at speed parity with DDP whilst providing memory improvement -* :ref:`deepspeed-zero-stage-2-offload` - **Offload optimizer states and gradients to CPU**. Increases communication, but significant memory improvement +* :ref:`deepspeed-zero-stage-2` - **Shard optimizer states and gradients**, remains at speed parity with DDP whilst providing even more memory improvement -* :ref:`deepspeed-zero-stage-3` - **Shard optimizer states, gradients, (Optional) activations and parameters**. Increases communication volume, but even more memory improvement +* :ref:`deepspeed-zero-stage-2-offload` - **Offload optimizer states and gradients to CPU**. Increases distributed communication volume and GPU-CPU device transfer, but provides significant memory improvement -* :ref:`deepspeed-zero-stage-3-offload` - **Offload optimizer states, gradients, (Optional) activations and parameters to CPU**. Increases communication, but even more signficant memory improvement. +* :ref:`deepspeed-zero-stage-3` - **Shard optimizer states, gradients, parameters and optionally activations**. Increases distributed communication volume, but provides even more memory improvement + +* :ref:`deepspeed-zero-stage-3-offload` - **Offload optimizer states, gradients, parameters and optionally activations to CPU**. Increases distributed communication volume and GPU-CPU device transfer, but even more signficant memory improvement. * :ref:`deepspeed-activation-checkpointing` - **Free activations after forward pass**. Increases computation, but provides memory improvement for all stages. @@ -227,12 +229,30 @@ If you run into an issue with the install or later in training, ensure that the When saving a checkpoint we rely on DeepSpeed which saves a directory containing the model and various components. +.. _deepspeed-zero-stage-1: + +DeepSpeed ZeRO Stage 1 +"""""""""""""""""""""" + +`DeepSpeed ZeRO Stage 1 `_ partitions your optimizer states (Stage 1) across your GPUs to reduce memory. + +It is recommended to skip Stage 1 and use Stage 2, which comes with larger memory improvements and still remains efficient. Stage 1 is useful to pair with certain optimizations such as `Torch ORT `__. + +.. code-block:: python + + from pytorch_lightning import Trainer + + model = MyModel() + trainer = Trainer(gpus=4, plugins="deepspeed_stage_1", precision=16) + trainer.fit(model) + + .. _deepspeed-zero-stage-2: DeepSpeed ZeRO Stage 2 """""""""""""""""""""" -By default, we enable `DeepSpeed ZeRO Stage 2 `_, which partitions your optimizer states (Stage 1) and your gradients (Stage 2) across your GPUs to reduce memory. In most cases, this is more efficient or at parity with DDP, primarily due to the optimized custom communications written by the DeepSpeed team. +`DeepSpeed ZeRO Stage 2 `_ partitions your optimizer states (Stage 1) and your gradients (Stage 2) across your GPUs to reduce memory. In most cases, this is more efficient or at parity with DDP, primarily due to the optimized custom communications written by the DeepSpeed team. As a result, benefits can also be seen on a single GPU. Do note that the default bucket sizes allocate around ``3.6GB`` of VRAM to use during distributed communications, which can be tweaked when instantiating the plugin described in a few sections below. .. code-block:: python diff --git a/pytorch_lightning/plugins/training_type/deepspeed.py b/pytorch_lightning/plugins/training_type/deepspeed.py index 940fe6cf4032e..31fdfde234462 100644 --- a/pytorch_lightning/plugins/training_type/deepspeed.py +++ b/pytorch_lightning/plugins/training_type/deepspeed.py @@ -795,6 +795,7 @@ def update_global_step(self, total_batch_idx: int, current_global_step: int) -> @classmethod def register_plugins(cls, plugin_registry: Dict) -> None: plugin_registry.register("deepspeed", cls, description="Default DeepSpeed Plugin") + plugin_registry.register("deepspeed_stage_1", cls, description="DeepSpeed with ZeRO Stage 1 enabled", stage=1) plugin_registry.register("deepspeed_stage_2", cls, description="DeepSpeed with ZeRO Stage 2 enabled", stage=2) plugin_registry.register( "deepspeed_stage_2_offload", diff --git a/tests/plugins/test_plugins_registry.py b/tests/plugins/test_plugins_registry.py index 9295c8a757b08..0cbf9bdd7827e 100644 --- a/tests/plugins/test_plugins_registry.py +++ b/tests/plugins/test_plugins_registry.py @@ -56,6 +56,7 @@ def __init__(self, param1, param2): "plugin_name, init_params", [ ("deepspeed", {}), + ("deepspeed_stage_1", {"stage": 1}), ("deepspeed_stage_2", {"stage": 2}), ("deepspeed_stage_2_offload", {"stage": 2, "offload_optimizer": True}), ("deepspeed_stage_3", {"stage": 3}),