Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LightningLite documentation #10043

Merged
merged 49 commits into from
Oct 22, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
4f51ade
update
tchaton Oct 20, 2021
a9025d5
add supported mechanism
tchaton Oct 20, 2021
7fe1189
add changelog
tchaton Oct 20, 2021
447fc68
add warning
tchaton Oct 20, 2021
2e43fa1
better title
tchaton Oct 20, 2021
d4d2d7b
update
tchaton Oct 20, 2021
eec67aa
add deepspeed
tchaton Oct 20, 2021
6d27edf
update
tchaton Oct 20, 2021
3e12242
update
tchaton Oct 20, 2021
8c1def5
add gif
tchaton Oct 21, 2021
96cb3b6
update docs
tchaton Oct 21, 2021
646272c
Merge branch 'lite-poc' into lite_doc
tchaton Oct 21, 2021
5526905
Merge branch 'lite_doc' of https://github.com/PyTorchLightning/pytorc…
tchaton Oct 21, 2021
52a853d
update
tchaton Oct 21, 2021
8481d45
rename lite
tchaton Oct 21, 2021
ab63c95
remove duplicat
tchaton Oct 21, 2021
9c2bbad
Merge branch 'lite-poc' into lite_doc
tchaton Oct 21, 2021
7a2c09e
update on comments
tchaton Oct 21, 2021
b20e3b7
update
tchaton Oct 21, 2021
f1f0ad0
update
tchaton Oct 21, 2021
5f70bbe
update
tchaton Oct 21, 2021
77d8222
update
tchaton Oct 21, 2021
1bc4eb7
typo
tchaton Oct 21, 2021
b5223bb
update
tchaton Oct 21, 2021
c328c57
update
tchaton Oct 21, 2021
6af4d9a
update
tchaton Oct 21, 2021
53aa9e6
update
tchaton Oct 21, 2021
7a81557
update
tchaton Oct 21, 2021
7219b47
update
tchaton Oct 21, 2021
b07f90c
apply comments
tchaton Oct 21, 2021
f97b075
update
tchaton Oct 21, 2021
c468ba2
update
tchaton Oct 21, 2021
6062834
update
tchaton Oct 21, 2021
7355f6e
update
tchaton Oct 21, 2021
a708f29
update
tchaton Oct 21, 2021
56305fa
update
tchaton Oct 22, 2021
a1fbb95
update
tchaton Oct 22, 2021
c34377e
typo
tchaton Oct 22, 2021
04d69e8
update
tchaton Oct 22, 2021
a7afbb1
update
tchaton Oct 22, 2021
123ea7a
update
tchaton Oct 22, 2021
5cc2ccd
update header in table
awaelchli Oct 22, 2021
6a0f89c
updarte
tchaton Oct 22, 2021
54871d7
update
tchaton Oct 22, 2021
dc77035
update
tchaton Oct 22, 2021
5e320a8
Merge remote-tracking branch 'origin/lite_doc' into lite_doc
awaelchli Oct 22, 2021
1c41f6b
update link
awaelchli Oct 22, 2021
f2d0c18
loss
rohitgr7 Oct 22, 2021
c81a04c
transpose table
awaelchli Oct 22, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,7 +215,7 @@ The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/).
* Implemented `DeepSpeedPlugin._setup_models_and_optimizers` ([#10009](https://github.com/PyTorchLightning/pytorch-lightning/pull/10009))
* Implemented `{DDPShardedPlugin,DDPShardedSpawnPlugin}._setup_models_and_optimizers` ([#10028](https://github.com/PyTorchLightning/pytorch-lightning/pull/10028))
* Added optional `model` argument to the `optimizer_step` methods in accelerators and plugins ([#10023](https://github.com/PyTorchLightning/pytorch-lightning/pull/10023))

* Add `LightningLite` documentation ([#10043](https://github.com/PyTorchLightning/pytorch-lightning/pull/10043))


- Added `XLACheckpointIO` plugin ([#9972](https://github.com/PyTorchLightning/pytorch-lightning/pull/9972))
Expand Down
11 changes: 11 additions & 0 deletions docs/source/api_references.rst
Original file line number Diff line number Diff line change
Expand Up @@ -243,6 +243,17 @@ Trainer API

trainer

LightningLite API
-----------------

.. currentmodule:: pytorch_lightning.lite

.. autosummary::
:toctree: api
:nosignatures:

LightningLite

Tuner API
---------

Expand Down
10 changes: 7 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
import os
import shutil
import sys
import warnings
from importlib.util import module_from_spec, spec_from_file_location

import pt_lightning_sphinx_theme
Expand All @@ -26,10 +27,13 @@
sys.path.insert(0, os.path.abspath(PATH_ROOT))
sys.path.append(os.path.join(PATH_RAW_NB, ".actions"))

_SHOULD_COPY_NOTEBOOKS = True
tchaton marked this conversation as resolved.
Show resolved Hide resolved

try:
from helpers import HelperCLI
except Exception:
raise ModuleNotFoundError("To build the code, please run: `git submodule update --init --recursive`")
_SHOULD_COPY_NOTEBOOKS = False
warnings.warn("To build the code, please run: `git submodule update --init --recursive`", stacklevel=2)

FOLDER_GENERATED = "generated"
SPHINX_MOCK_REQUIREMENTS = int(os.environ.get("SPHINX_MOCK_REQUIREMENTS", True))
Expand All @@ -41,8 +45,8 @@
spec.loader.exec_module(about)

# -- Project documents -------------------------------------------------------

HelperCLI.copy_notebooks(PATH_RAW_NB, PATH_HERE, "notebooks")
if _SHOULD_COPY_NOTEBOOKS:
HelperCLI.copy_notebooks(PATH_RAW_NB, PATH_HERE, "notebooks")


def _transform_changelog(path_in: str, path_out: str) -> None:
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ PyTorch Lightning
starter/new-project
starter/converting
starter/rapid_prototyping_templates
starter/lightning_lite

.. toctree::
:maxdepth: 1
Expand All @@ -33,7 +34,6 @@ PyTorch Lightning
Lightning project template<https://github.com/PyTorchLightning/pytorch-lightning-conference-seed>
benchmarking/benchmarks


.. toctree::
:maxdepth: 2
:name: pl_docs
Expand Down
284 changes: 284 additions & 0 deletions docs/source/starter/lightning_lite.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,284 @@
###########################################
LightningLite - Stepping Stone to Lightning
###########################################


.. image:: https://pl-public-data.s3.amazonaws.com/docs/static/images/lite/lightning_lite.gif
:alt: Animation showing how to convert a standard training loop to a Lightning loop
tchaton marked this conversation as resolved.
Show resolved Hide resolved



:class:`~pytorch_lightning.lite.LightningLite` enables pure PyTorch users to scale their existing code
on any kind of device while retaining full control over their own loops and optimization logic.

:class:`~pytorch_lightning.lite.LightningLite` is the right tool for you if you match one of the two following descriptions:
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved

- I want to quickly scale my existing code to multiple devices with minimal code changes.

- I would like to convert my existing code to the Lightning API, but a full path to Lightning transition might be too complex. I am looking for a stepping stone to ensure reproducibility during the transition.

Supported Integrations
======================

:class:`~pytorch_lightning.lite.LightningLite` supports single and multiple models and optimizers.

.. list-table::
:widths: 50 50
:header-rows: 1

* - LightningLite arguments
- Possible choices
* - ``accelerator``
- ``cpu``, ``gpu``, ``tpu``, ``auto``
* - ``strategy``
- ``dp``, ``ddp``, ``ddp_spawn``, ``ddp_sharded``, ``ddp_sharded_spawn``, ``deepspeed``
awaelchli marked this conversation as resolved.
Show resolved Hide resolved
* - ``precision``
- ``16``, ``bf16``, ``32``, ``64``
rohitgr7 marked this conversation as resolved.
Show resolved Hide resolved
* - ``clusters``
- ``TorchElastic``, ``SLURM``, ``Kubeflow``, ``LSF``


Coming soon: IPU accelerator, support for Horovod as a strategy and fully sharded training.


################
Learn by example
################

My existing PyTorch code
========================

The ``run`` function contains custom training loop used to train ``MyModel`` on ``MyDataset`` for ``num_epochs`` epochs.

.. code-block:: python

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset


class MyModel(nn.Module):
...


class MyDataset(Dataset):
...


def run(num_epochs: int):

device = "cuda" if torch.cuda.is_available() else "cpu"

model = MyModel(...).to(device)
optimizer = torch.optim.SGD(model.parameters(), ...)

dataloader = DataLoader(MyDataset(...), ...)

model.train()
for epoch in range(num_epochs):
for batch in dataloader:
batch = batch.to(device)
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()


run(10)

Convert to LightningLite
========================

Here are 4 required steps to convert to :class:`~pytorch_lightning.lite.LightningLite`.

1. Subclass :class:`~pytorch_lightning.lite.LightningLite` and override its :meth:`~pytorch_lightning.lite.LightningLite.run` method.
2. Move the body of your existing `run` function.
3. Apply :meth:`~pytorch_lightning.lite.LightningLite.setup` over each model and optimizers pair, :meth:`~pytorch_lightning.lite.LightningLite.setup_dataloaders` on all your dataloaders and replace ``loss.backward()`` by ``self.backward(loss)``
4. Instantiate your :class:`~pytorch_lightning.lite.LightningLite` and call its :meth:`~pytorch_lightning.lite.LightningLite.run` method.


.. code-block:: python

import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning.lite import LightningLite


class MyModel(nn.Module):
...


class MyDataset(Dataset):
...


class Lite(LightningLite):
def run(self, num_epochs: int):

model = MyModel(...)
optimizer = torch.optim.SGD(model.parameters(), ...)

model, optimizer = self.setup(model, optimizer)

dataloader = DataLoader(MyDataset(...), ...)
dataloader = self.setup_dataloaders(dataloader)

model.train()
for epoch in range(num_epochs):
for batch in dataloader:
optimizer.zero_grad()
loss = model(batch)
self.backward(loss)
tchaton marked this conversation as resolved.
Show resolved Hide resolved
optimizer.step()


Lite(...).run(10)


That's all. You can now train on any kind of device and scale your training.

The :class:`~pytorch_lightning.lite.LightningLite` takes care of device management, so you don't have to.

You can remove any device specific logic within your code.

Here is how to train on 8 GPUs with `torch.bfloat16 <https://pytorch.org/docs/1.10.0/generated/torch.Tensor.bfloat16.html>`_ precision:

.. code-block:: python

Lite(strategy="ddp", devices=8, accelerator="gpu", precision="bf16").run(10)

Here is how to use `DeepSpeed Zero3 <https://www.deepspeed.ai/news/2021/03/07/zero3-offload.html>`_ with 8 GPUs and precision 16:

.. code-block:: python

Lite(strategy="deepspeed", devices=8, accelerator="gpu", precision=16).run(10)

Lightning can also figure it automatically for you !

.. code-block:: python

Lite(devices="auto", accelerator="auto", precision=16).run(10)


You can also easily use distributed collectives if required.
Here is an example while running on 256 GPUs.

.. code-block:: python

class Lite(LightningLite):
def run(self):

# Transfer and concatenate tensors across processes
self.all_gather(...)

# Transfer an object from one process to all the others
self.broadcast(..., src=...)

# The total number of processes running across all devices and nodes.
self.world_size

# The global index of the current process across all devices and nodes.
self.global_rank

# The index of the current process among the processes running on the local node.
self.local_rank

# The index of the current node.
self.node_rank

# Wether this global rank is rank zero.
if self.is_global_zero:
# do something on rank 0
...

# Wait for all processes to enter this call.
self.barrier()

# Reduce a boolean decision across processes.
self.reduce_decision(...)


Lite(strategy="ddp", gpus=8, num_nodes=32, accelerator="gpu").run()


.. note:: We recommend instantiating the models within the :meth:`~pytorch_lightning.lite.LightningLite.run` method as large models would cause OOM Error otherwise.


Distributed Training Pitfalls
=============================
Comment on lines +208 to +209
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a better title to this is When to convert to Lightning: Distributed Training

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awaelchli Any suggestions ? I prefer the current version.


The :class:`~pytorch_lightning.lite.LightningLite` provides you only with the tool to scale your training,
but there are several major challenges ahead of you now:


.. list-table::
:widths: 50 50
:header-rows: 0

* - Processes divergence
- This happens when processes execute different section of the code due to different if/else condition, race condition on existing files, etc., resulting in hanging.
* - Cross processes reduction
- Wrongly reported metrics or gradients due mis-reduction.
* - Large sharded models
- Instantiation, materialization and state management of large models.
* - Rank 0 only actions
- Logging, profiling, etc.
* - Checkpointing / Early stopping / Callbacks
- Ability to easily customize your training behaviour and make it stateful.
* - Batch-level fault tolerance training
- Ability to resume from a failure as if it never happened.


If you are facing one of those challenges then you are already meeting the limit of :class:`~pytorch_lightning.lite.LightningLite`.
We recommend you to convert to :doc:`Lightning <../starter/new-project>`, so you never have to worry about those.

Convert to Lightning
====================

The :class:`~pytorch_lightning.lite.LightningLite` is a stepping stone to transition fully to the Lightning API and benefits
from its hundreds of features.

.. code-block:: python

from pytorch_lightning import LightningDataModule, LightningModule, Trainer


class LiftModel(LightningModule):
def __init__(self, module: nn.Module):
super().__init__()
self.module = module

def forward(self, x):
return self.module(x)

def training_step(self, batch, batch_idx):
loss = self(batch)
self.log("train_loss", loss)
return loss

def validation_step(self, batch, batch_idx):
loss = self(batch)
self.log("val_loss", loss)
return loss

def configure_optimizers(self):
return torch.optim.SGD(self.parameters(), lr=0.001)


class BoringDataModule(LightningDataModule):
def __init__(self, dataset: Dataset):
super().__init__()
self.dataset = dataset

def train_dataloader(self):
return DataLoader(self.dataset)


seed_everything(42)
model = MyModel(...)
lightning_module = LiftModel(model)
dataset = MyDataset(...)
datamodule = BoringDataModule(dataset)
trainer = Trainer(max_epochs=10)
trainer.fit(lightning_module, datamodule=datamodule)
2 changes: 1 addition & 1 deletion pytorch_lightning/lite/lite.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ def print(self, *args: Any, **kwargs: Any) -> None:

def barrier(self) -> None:
"""Wait for all processes to enter this call. Use this to synchronize all parallel processes, but only if
necessary, otherwhise the overhead of synchronization will cause your program to slow down.
necessary, otherwise the overhead of synchronization will cause your program to slow down.

Example::

Expand Down