Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[2D Parallelism] Tracking feasibility #9931

Open
8 of 20 tasks
stas00 opened this issue Feb 1, 2021 · 2 comments
Open
8 of 20 tasks

[2D Parallelism] Tracking feasibility #9931

stas00 opened this issue Feb 1, 2021 · 2 comments
Assignees
Labels
DeepSpeed Model Parallel Model Parallelilsm Implementations Pipeline Parallel WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 1, 2021

Background

ZeRO-DP (ZeRO Data Parallel) and PP (Pipeline Parallelism) provide each a great memory saving over multiple GPUs. Each 1D allows for a much more efficient utilization of the gpu memory, but it's still not enough for very big models - sometimes not even feasible with any of the existing hardware. e.g. a model that's 45GB-big with just model params (t5-11b) can't fit even on a 40GB GPU.

The next stage in Model Parallelism that can enable loading bigger models onto smaller hardware is 2D Parallelism. That's combining Pipeline Parallelism (PP) with ZeRO-DP.

3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM, but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with a relatively low hanging fruit of 2D.


Tracking

We have 3 implementations that provide the required components to build 2D Parallelism:

  1. DeepSpeed (DS)
  2. FairScale (FS)
  3. PyTorch (native) (PT)

and the purpose of this issue is to track the feasibility/status/inter-operability in each one of them. And also which parts have been back-ported to PyTorch core.

Plus it tracks the status of where transformers models are at with regards to the above 3 implementations.

The 2 main questions are:

  1. native 2D: how do we integrate a native PP with native ZeRO-DP (sharded) (e.g. can fairscale PP work with fairscale ZeRO-DP)
  2. inter-operability 2D: is there a chance one implementation of PP/ZeRO-DP could work with one or both others ZeRO-DP/PP (e.g. can fairscale PP work with DeepSpeed ZeRO-DP).

Notes

  • 3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM), but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with low hanging fruit of 2D.

  • MPU = Model Parallel Unit - a little helper module that helps each 1D to know which gpu groups it can use for PP, which for MP, which for DP. So that one 1D doesn't interfere with another 1D. e.g. in the case of 4 gpus and PP+DP, one may want:

          pp
    dp0 [0, 1]
    dp1 [2, 3] 
    

    So here there are 2 pipelines: 0-1, and 2-3, and DP sees gpus 0 and 2 as the entry points.


TLDR

ZeRO-DP / PP inter-operability status

DS FS PT
DS ✔️
FS
PT

1. DeepSpeed

1D status:

2D native status:

  • ❓ native PP + ZeRO-DP - untested yet, as it requires porting transformers to native PP first

2D inter-operability status:

Important components:


2. FairScale

Just started gather information on this one - will update once I have it.

1D status:

2D native status:

2D inter-operability status:

  • ❓ pytorch PP + fairscale ZeRO-DP gathering info
  • ❓ DeepSpeed PP + fairscale ZeRO-DP gathering info

Important components:


3. PyTorch

pytorch has been integrating from what I understand primarily fairscale version into its core.

1D status:

2D native status:

  • ❕ native PP + ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)

2D inter-operability status:

  • ❕ DeepSpeed PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)
  • ❕ fairscale PP + Pytorch ZeRO-DP (Pytorch ZeRO-DP doesn't exists yet)

Important components:

  • MPU: ?

Ported components:

Issues to track:


Transformers

To make 2D Parallelism working we need to of course support all these stages in transformers, so here is a status on what we have working or what is a work in progress. Some components (like bart-mp) work but are unmerged since we are still unsure how to move forward project-wide.

@stas00 stas00 added Model Parallel Model Parallelilsm Implementations DeepSpeed Pipeline Parallel labels Feb 1, 2021
@stas00 stas00 self-assigned this Feb 1, 2021
@LifeIsStrange
Copy link

LifeIsStrange commented Mar 13, 2021

Zero-3 has recently been announced
https://news.ycombinator.com/item?id=26447018

ZeRO-3 Offload goes beyond the state-of-the-art hybrid 3D-parallelism (data, model and pipeline parallelism combined). While 3D Parallelism is limited by the aggregate GPU memory, ZeRO-3 Offload can exploit both GPU and CPU memory, the latter of which is much larger and cheaper compared to GPU memory. This allows ZeRO-3 Offload to train larger model sizes with the given GPU and CPU resources than any other currently available technology.

@stas00
Copy link
Contributor Author

stas00 commented Mar 14, 2021

Thank you for the heads up, @LifeIsStrange

This particular issue collects notes on something quite orthogonal to ZeRO-3, see #9766 for a more suitable discussion.

And yes, we are working on integrating ZeRO3 from fairscale and Deepspeed into transformers. There are still some rough edges but hopefully it'll be ready really soon now.

@huggingface huggingface deleted a comment from github-actions bot Apr 14, 2021
@stas00 stas00 added the WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress label Apr 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DeepSpeed Model Parallel Model Parallelilsm Implementations Pipeline Parallel WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

No branches or pull requests

2 participants