[wip] [pipeline parallel] t5 - experiment #2 #9940
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The first attempt at t5/pp using pytorch-nightly Pipe #9765 was successful to a degree, but at the moment can't be combined with any other Parallel solutions.
All the examples of Pipeline conversion use trivial examples or models that lend easily to being converted to
Sequential
.transformers
models or at leastt5
doesn't easily lend to this transformation due to complex intertwined logic and a huge number of variables passed around.The main challenge: In order to build a Pipeline one needs to convert the Module stack into a
Sequential
list.So in the case of t5, we need to convert this logic:
into
I think we don't need to Sequentialize any further beyond T5Block, but we will have to see down the road.
Problems:
I did successfully implement a t5-pipeline version #9765 that uses 2 shorter pipes, as it was natural to convert a loop over
T5Block
s toSequential
and it now looks like thisusing pytorch pipe in a very painful way overcoming problem n2. But it's doubtful this approach will work with any other 1D Parallel side (e.g. combining with Sharded DDP) - definitely doesn't work with DeepSpeed Zero-DP.
But that implementation won't work with DeepSpeed pipeline - it has to be Sequential from the top-level. Not sure about fairscale yet.
So I'm trying again, this time starting by just trying to Sequentialize the layers while overcoming problem n1.
If you do look at the code, please ignore everything in the diff but
modeling_t5.py
(and I removed a lot of the model parallel code as it is getting in the way and it won't be needed if we figure out the pipe - sincepipe(chunks=1) == naive vertical MP
, so we get all the complex things that MP currently does for free. But we have to do even more complicated things instead. Naive vertical MP appears to be trivial compared to the changes required to make pipe work.You can see the process of conversion in this PR, I Sequentialized:
T5Block
-loopT5Stack
,now I need to continue breaking up the structure upstream. At this stage there is no Pipe in the code, the 1st main difficulty is to Sequentilize the layers.
If you want to see just how I converted the
T5Block
-loop into Sequential, it is this commit - might be easier to see: 4c0ea52 The input/output have to be the same because Sequential sends the output of one stage to the input of another.If you have some brilliant ideas that I'm perhaps missing at how to easily Sequentialize t5 layers I'm all ears.
@patrickvonplaten, @sgugger, @LysandreJik