Skip to content

Commit

Permalink
[doc][c10d] Fixes to FSDP tutorial
Browse files Browse the repository at this point in the history
Summary:
Fix up the actual FSDP tutorial to get it running again.
https://github.com/pytorch/examples/pull/1297/files
This tutorial is refered to in this document.

In addition to this, minor fixups in this document.
1. Fix typo in link.
2. Add a grid card with pre-requisites and what you will learn.
3. Add more links to actual FSDP paper.
4. Stop refering to PyTorch nightly, instead just tell reader to get
   latest PyTorch as FSDP has been released for a while.

Test Plan:
Render and examine.

Reviewers:

Subscribers:

Tasks:

Tags:
  • Loading branch information
c-p-i-o committed Oct 31, 2024
1 parent c7e1819 commit 70dc763
Show file tree
Hide file tree
Showing 3 changed files with 57 additions and 42 deletions.
2 changes: 1 addition & 1 deletion distributed/home.rst
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ Learn FSDP

.. grid-item-card:: :octicon:`file-code;1em`
FSDP Advanced
:link: https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
:link: https://pytorch.org/tutorials/intermediate/FSDP_advanced_tutorial.html?utm_source=distr_landing&utm_medium=FSDP_advanced
:link-type: url

In this tutorial, you will learn how to fine-tune a HuggingFace (HF) T5
Expand Down
4 changes: 2 additions & 2 deletions index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -763,7 +763,7 @@ Welcome to PyTorch Tutorials
:header: Advanced Model Training with Fully Sharded Data Parallel (FSDP)
:card_description: Explore advanced model training with Fully Sharded Data Parallel package.
:image: _static/img/thumbnails/cropped/Getting-Started-with-FSDP.png
:link: intermediate/FSDP_adavnced_tutorial.html
:link: intermediate/FSDP_advanced_tutorial.html
:tags: Parallel-and-Distributed-Training

.. customcarditem::
Expand Down Expand Up @@ -1115,7 +1115,7 @@ Additional Resources
intermediate/ddp_tutorial
intermediate/dist_tuto
intermediate/FSDP_tutorial
intermediate/FSDP_adavnced_tutorial
intermediate/FSDP_advanced_tutorial
intermediate/TCPStore_libuv_backend
intermediate/TP_tutorial
intermediate/pipelining_tutorial
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,44 @@ Wright <https://github.com/lessw2020>`__, `Rohan Varma
<https://github.com/rohan-varma/>`__, `Yanli Zhao
<https://github.com/zhaojuanmao>`__

.. grid:: 2

.. grid-item-card:: :octicon:`mortar-board;1em;` What you will learn
:class-card: card-prerequisites

* PyTorch's Fully Sharded Data Parallel Module: A wrapper for sharding module parameters across
data parallel workers.




.. grid-item-card:: :octicon:`list-unordered;1em;` Prerequisites
:class-card: card-prerequisites

* PyTorch 1.12 or later
* Read about the `FSDP API <https://pytorch.org/docs/main/fsdp.html>`__.


This tutorial introduces more advanced features of Fully Sharded Data Parallel
(FSDP) as part of the PyTorch 1.12 release. To get familiar with FSDP, please
refer to the `FSDP getting started tutorial
<https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html>`__.

In this tutorial, we fine-tune a HuggingFace (HF) T5 model with FSDP for text
summarization as a working example.
summarization as a working example.

The example uses Wikihow and for simplicity, we will showcase the training on a
single node, P4dn instance with 8 A100 GPUs. We will soon have a blog post on
large scale FSDP training on a multi-node cluster, please stay tuned for that on
the PyTorch medium channel.
single node, P4dn instance with 8 A100 GPUs. We now have several blog posts (
`(link1), <https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/>`__
`(link2) <https://engineering.fb.com/2021/07/15/open-source/fsdp/>`__)
and a `paper <https://arxiv.org/abs/2304.11277>`__ on
large scale FSDP training on a multi-node cluster.

FSDP is a production ready package with focus on ease of use, performance, and
long-term support. One of the main benefits of FSDP is reducing the memory
footprint on each GPU. This enables training of larger models with lower total
memory vs DDP, and leverages the overlap of computation and communication to
train models efficiently.
train models efficiently.
This reduced memory pressure can be leveraged to either train larger models or
increase batch size, potentially helping overall training throughput. You can
read more about PyTorch FSDP `here
Expand All @@ -47,21 +66,21 @@ Recap on How FSDP Works

At a high level FDSP works as follow:

*In constructor*
*In the constructor*

* Shard model parameters and each rank only keeps its own shard

*In forward pass*
*In the forward pass*

* Run `all_gather` to collect all shards from all ranks to recover the full
parameter for this FSDP unit Run forward computation
* Discard non-owned parameter shards it has just collected to free memory
parameter for this FSDP unit and run the forward computation
* Discard the non-owned parameter shards it has just collected to free memory

*In backward pass*
*In the backward pass*

* Run `all_gather` to collect all shards from all ranks to recover the full
parameter in this FSDP unit Run backward computation
* Discard non-owned parameters to free memory.
parameter in this FSDP unit and run backward computation
* Discard non-owned parameters to free memory.
* Run reduce_scatter to sync gradients


Expand All @@ -80,15 +99,11 @@ examples

*Setup*

1.1 Install PyTorch Nightlies

We will install PyTorch nightlies, as some of the features such as activation
checkpointing is available in nightlies and will be added in next PyTorch
release after 1.12.
1.1 Install the latest PyTorch

.. code-block:: bash
.. code-block:: bash
pip3 install --pre torch torchvision torchaudio -f https://download.pytorch.org/whl/nightly/cu113/torch_nightly.html
pip3 install torch torchvision torchaudio
1.2 Dataset Setup

Expand Down Expand Up @@ -154,7 +169,7 @@ Next, we add the following code snippets to a Python script “T5_training.py”
import tqdm
from datetime import datetime
1.4 Distributed training setup.
1.4 Distributed training setup.
Here we use two helper functions to initialize the processes for distributed
training, and then to clean up after training completion. In this tutorial, we
are going to use torch elastic, using `torchrun
Expand Down Expand Up @@ -191,13 +206,13 @@ metrics.
date_of_run = datetime.now().strftime("%Y-%m-%d-%I:%M:%S_%p")
print(f"--> current date and time of run = {date_of_run}")
return date_of_run
def format_metrics_to_gb(item):
"""quick function to format numbers to gigabyte and round to 4 digit precision"""
metric_num = item / g_gigabyte
metric_num = round(metric_num, ndigits=4)
return metric_num
2.2 Define a train function:

Expand Down Expand Up @@ -275,7 +290,7 @@ metrics.

.. code-block:: python
def fsdp_main(args):
model, tokenizer = setup_model("t5-base")
Expand All @@ -292,7 +307,7 @@ metrics.
#wikihow(tokenizer, type_path, num_samples, input_length, output_length, print_text=False)
train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False)
train_dataset = wikihow(tokenizer, 'train', 1500, 512, 150, False)
val_dataset = wikihow(tokenizer, 'validation', 300, 512, 150, False)
sampler1 = DistributedSampler(train_dataset, rank=rank, num_replicas=world_size, shuffle=True)
Expand Down Expand Up @@ -430,7 +445,7 @@ metrics.

.. code-block:: python
if __name__ == '__main__':
# Training settings
parser = argparse.ArgumentParser(description='PyTorch T5 FSDP Example')
Expand Down Expand Up @@ -463,7 +478,7 @@ metrics.
To run the the training using torchrun:

.. code-block:: bash
.. code-block:: bash
torchrun --nnodes 1 --nproc_per_node 4 T5_training.py
Expand All @@ -487,7 +502,7 @@ communication efficient. In PyTorch 1.12, FSDP added this support and now we
have a wrapping policy for transfomers.

It can be created as follows, where the T5Block represents the T5 transformer
layer class (holding MHSA and FFN).
layer class (holding MHSA and FFN).


.. code-block:: python
Expand All @@ -499,7 +514,7 @@ layer class (holding MHSA and FFN).
},
)
torch.cuda.set_device(local_rank)
model = FSDP(model,
auto_wrap_policy=t5_auto_wrap_policy)
Expand All @@ -513,22 +528,22 @@ Mixed Precision
FSDP supports flexible mixed precision training allowing for arbitrary reduced
precision types (such as fp16 or bfloat16). Currently BFloat16 is only available
on Ampere GPUs, so you need to confirm native support before you use it. On
V100s for example, BFloat16 can still be run but due to it running non-natively,
V100s for example, BFloat16 can still be run but because it runs non-natively,
it can result in significant slowdowns.

To check if BFloat16 is natively supported, you can use the following :

.. code-block:: python
bf16_ready = (
torch.version.cuda
and torch.cuda.is_bf16_supported()
and torch.cuda.is_bf16_supported()
and LooseVersion(torch.version.cuda) >= "11.0"
and dist.is_nccl_available()
and nccl.version() >= (2, 10)
)
One of the advantages of mixed percision in FSDP is providing granular control
One of the advantages of mixed precision in FSDP is providing granular control
over different precision levels for parameters, gradients, and buffers as
follows:

Expand Down Expand Up @@ -571,7 +586,7 @@ with the following policy:
.. code-block:: bash
grad_bf16 = MixedPrecision(reduce_dtype=torch.bfloat16)
In 2.4 we just add the relevant mixed precision policy to the FSDP wrapper:

Expand Down Expand Up @@ -604,9 +619,9 @@ CPU-based initialization:
auto_wrap_policy=t5_auto_wrap_policy,
mixed_precision=bfSixteen,
device_id=torch.cuda.current_device())
Sharding Strategy
-----------------
FSDP sharding strategy by default is set to fully shard the model parameters,
Expand All @@ -627,7 +642,7 @@ instead of "ShardingStrategy.FULL_SHARD" to the FSDP initialization as follows:
sharding_strategy=ShardingStrategy.SHARD_GRAD_OP # ZERO2)
This will reduce the communication overhead in FSDP, in this case, it holds full
parameters after forward and through the backwards pass.
parameters after forward and through the backwards pass.
This saves an all_gather during backwards so there is less communication at the
cost of a higher memory footprint. Note that full model params are freed at the
Expand All @@ -652,12 +667,12 @@ wrapper in 2.4 as follows:
mixed_precision=bfSixteen,
device_id=torch.cuda.current_device(),
backward_prefetch = BackwardPrefetch.BACKWARD_PRE)
`backward_prefetch` has two modes, `BACKWARD_PRE` and `BACKWARD_POST`.
`BACKWARD_POST` means that the next FSDP unit's params will not be requested
until the current FSDP unit processing is complete, thus minimizing memory
overhead. In some cases, using `BACKWARD_PRE` can increase model training speed
up to 2-10%, with even higher speed improvements noted for larger models.
up to 2-10%, with even higher speed improvements noted for larger models.
Model Checkpoint Saving, by streaming to the Rank0 CPU
------------------------------------------------------
Expand Down Expand Up @@ -696,7 +711,7 @@ Pytorch 1.12 and used HF T5 as the running example. Using the proper wrapping
policy especially for transformer models, along with mixed precision and
backward prefetch should speed up your training runs. Also, features such as
initializing the model on device, and checkpoint saving via streaming to CPU
should help to avoid OOM error in dealing with large models.
should help to avoid OOM error in dealing with large models.
We are actively working to add new features to FSDP for the next release. If
you have feedback, feature requests, questions or are encountering issues
Expand Down

0 comments on commit 70dc763

Please sign in to comment.