[2D Parallelism] Tracking feasibility #9931
Labels
DeepSpeed
Model Parallel
Model Parallelilsm Implementations
Pipeline Parallel
WIP
Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Background
ZeRO-DP (ZeRO Data Parallel) and PP (Pipeline Parallelism) provide each a great memory saving over multiple GPUs. Each 1D allows for a much more efficient utilization of the gpu memory, but it's still not enough for very big models - sometimes not even feasible with any of the existing hardware. e.g. a model that's 45GB-big with just model params (t5-11b) can't fit even on a 40GB GPU.
The next stage in Model Parallelism that can enable loading bigger models onto smaller hardware is 2D Parallelism. That's combining Pipeline Parallelism (PP) with ZeRO-DP.
3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM, but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with a relatively low hanging fruit of 2D.
Tracking
We have 3 implementations that provide the required components to build 2D Parallelism:
and the purpose of this issue is to track the feasibility/status/inter-operability in each one of them. And also which parts have been back-ported to PyTorch core.
Plus it tracks the status of where transformers models are at with regards to the above 3 implementations.
The 2 main questions are:
Notes
3D Parallelism is possible too and it requires adding a horizontal MP (ala Megatron-LM), but we don't quite have any way to implement that yet. Need to study Megatron-LM first. So starting with low hanging fruit of 2D.
MPU = Model Parallel Unit - a little helper module that helps each 1D to know which gpu groups it can use for PP, which for MP, which for DP. So that one 1D doesn't interfere with another 1D. e.g. in the case of 4 gpus and PP+DP, one may want:
So here there are 2 pipelines: 0-1, and 2-3, and DP sees gpus 0 and 2 as the entry points.
TLDR
ZeRO-DP / PP inter-operability status
1. DeepSpeed
1D status:
2D native status:
2D inter-operability status:
Important components:
2. FairScale
Just started gather information on this one - will update once I have it.
1D status:
2D native status:
2D inter-operability status:
Important components:
3. PyTorch
pytorch has been integrating from what I understand primarily fairscale version into its core.
1D status:
2D native status:
2D inter-operability status:
Important components:
Ported components:
Issues to track:
Transformers
To make 2D Parallelism working we need to of course support all these stages in
transformers
, so here is a status on what we have working or what is a work in progress. Some components (like bart-mp) work but are unmerged since we are still unsure how to move forward project-wide.ZeRO-DP
Naive vertical MP (aka PP w/ a single stage)
Pytorch PP
Horizontal MP - unresearched!
The text was updated successfully, but these errors were encountered: