[2D Parallelism] Tracking feasibility #9931

stas00 · 2021-02-01T19:42:37Z

Background

ZeRO-DP (ZeRO Data Parallel) and PP (Pipeline Parallelism) provide each a great memory saving over multiple GPUs. Each 1D allows for a much more efficient utilization of the gpu memory, but it's still not enough for very big models - sometimes not even feasible with any of the existing hardware. e.g. a model that's 45GB-big with just model params (t5-11b) can't fit even on a 40GB GPU.

The next stage in Model Parallelism that can enable loading bigger models onto smaller hardware is 2D Parallelism. That's combining Pipeline Parallelism (PP) with ZeRO-DP.

3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM, but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with a relatively low hanging fruit of 2D.

Tracking

We have 3 implementations that provide the required components to build 2D Parallelism:

DeepSpeed (DS)
FairScale (FS)
PyTorch (native) (PT)

and the purpose of this issue is to track the feasibility/status/inter-operability in each one of them. And also which parts have been back-ported to PyTorch core.

Plus it tracks the status of where transformers models are at with regards to the above 3 implementations.

The 2 main questions are:

native 2D: how do we integrate a native PP with native ZeRO-DP (sharded) (e.g. can fairscale PP work with fairscale ZeRO-DP)
inter-operability 2D: is there a chance one implementation of PP/ZeRO-DP could work with one or both others ZeRO-DP/PP (e.g. can fairscale PP work with DeepSpeed ZeRO-DP).

Notes

3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM), but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with low hanging fruit of 2D.
MPU = Model Parallel Unit - a little helper module that helps each 1D to know which gpu groups it can use for PP, which for MP, which for DP. So that one 1D doesn't interfere with another 1D. e.g. in the case of 4 gpus and PP+DP, one may want:
```
      pp
dp0 [0, 1]
dp1 [2, 3] 
```
So here there are 2 pipelines: 0-1, and 2-3, and DP sees gpus 0 and 2 as the entry points.

TLDR

ZeRO-DP / PP inter-operability status

	DS	FS	PT
DS	✔️	❓	❌
FS	❓	❓	❓
PT	❌	❓	❓

1. DeepSpeed

1D status:

PP
ZeRO-DP

2D native status:

❓ native PP + ZeRO-DP - untested yet, as it requires porting transformers to native PP first

2D inter-operability status:

❌ pytorch PP + DeepSpeed ZeRO-DP. I tried using pytorch PP with DeepSpeed ZeRO-DP and couldn't figure out how to make it work: stuck in trying to combine PP with DeepSpeed deepspeedai/DeepSpeed#710
❓ fairscale PP + DeepSpeed ZeRO-DP (unknown)

Important components:

2. FairScale

Just started gather information on this one - will update once I have it.

1D status:

PP
ZeRO-DP

2D native status:

❓ native PP + ZeRO-DP - gathering info How to integrate 2D Parallelism: PP + ZeRO-DP? facebookresearch/fairscale#351

2D inter-operability status:

❓ pytorch PP + fairscale ZeRO-DP gathering info
❓ DeepSpeed PP + fairscale ZeRO-DP gathering info

Important components:

MPU

3. PyTorch

pytorch has been integrating from what I understand primarily fairscale version into its core.

1D status:

PP - experimental support. have PoC t5 working: [wip] [pipeline parallel] t5 - experiment #9765 example
ZeRO-DP - plans to implement that (primarily integrating fairscale implementation)

2D native status:

❕ native PP + ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)

2D inter-operability status:

❕ DeepSpeed PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)
❕ fairscale PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)

Important components:

MPU: ?

Ported components:

ZeRO-DP stage 1: ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper ZeroRedundancyOptimizer: an implementation of a standalone sharded optimizer wrapper pytorch/pytorch#46750

Issues to track:

The main discussion around integrating Deepspeed ZeRO into pytorch core: [RFC] DeepSpeed + PT Distributed Integration pytorch/pytorch#42849

Transformers

To make 2D Parallelism working we need to of course support all these stages in transformers, so here is a status on what we have working or what is a work in progress. Some components (like bart-mp) work but are unmerged since we are still unsure how to move forward project-wide.

ZeRO-DP
- works across all models with fairscale and DeepSpeed integrated.
Naive vertical MP (aka PP w/ a single stage)
- t5
- gpt2
- bart - unmerged [model parallelism] Bart goes parallel #9384
Pytorch PP
- t5 - unmerged [wip] [pipeline parallel] t5 - experiment #9765
Horizontal MP - unresearched!

The text was updated successfully, but these errors were encountered:

LifeIsStrange · 2021-03-13T22:37:41Z

Zero-3 has recently been announced
https://news.ycombinator.com/item?id=26447018

ZeRO-3 Offload goes beyond the state-of-the-art hybrid 3D-parallelism (data, model and pipeline parallelism combined). While 3D Parallelism is limited by the aggregate GPU memory, ZeRO-3 Offload can exploit both GPU and CPU memory, the latter of which is much larger and cheaper compared to GPU memory. This allows ZeRO-3 Offload to train larger model sizes with the given GPU and CPU resources than any other currently available technology.

stas00 · 2021-03-14T02:14:22Z

Thank you for the heads up, @LifeIsStrange

This particular issue collects notes on something quite orthogonal to ZeRO-3, see #9766 for a more suitable discussion.

And yes, we are working on integrating ZeRO3 from fairscale and Deepspeed into transformers. There are still some rough edges but hopefully it'll be ready really soon now.

stas00 added Model Parallel Model Parallelilsm Implementations DeepSpeed Pipeline Parallel labels Feb 1, 2021

stas00 self-assigned this Feb 1, 2021

stas00 mentioned this issue Feb 1, 2021

Model Parallelism and Big Models #8771

Open

huggingface deleted a comment from github-actions bot Apr 14, 2021

stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2D Parallelism] Tracking feasibility #9931

[2D Parallelism] Tracking feasibility #9931

stas00 commented Feb 1, 2021 •

edited

Loading

LifeIsStrange commented Mar 13, 2021 •

edited

Loading

stas00 commented Mar 14, 2021

[2D Parallelism] Tracking feasibility #9931

[2D Parallelism] Tracking feasibility #9931

Comments

stas00 commented Feb 1, 2021 • edited Loading

Background

Tracking

Notes

TLDR

1. DeepSpeed

2. FairScale

3. PyTorch

Transformers

LifeIsStrange commented Mar 13, 2021 • edited Loading

stas00 commented Mar 14, 2021

stas00 commented Feb 1, 2021 •

edited

Loading

LifeIsStrange commented Mar 13, 2021 •

edited

Loading