Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] [pipeline parallel] t5 - experiment #2 #9940

Closed
wants to merge 5 commits into from

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Feb 2, 2021

The first attempt at t5/pp using pytorch-nightly Pipe #9765 was successful to a degree, but at the moment can't be combined with any other Parallel solutions.

All the examples of Pipeline conversion use trivial examples or models that lend easily to being converted to Sequential. transformers models or at least t5 doesn't easily lend to this transformation due to complex intertwined logic and a huge number of variables passed around.

The main challenge: In order to build a Pipeline one needs to convert the Module stack into a Sequential list.

So in the case of t5, we need to convert this logic:

T5ForConditionalGeneration->
  logic
  T5Stack-> 
     logic 
     loop(T5Block, T5Block, T5Block, ...) ->
     logic
  logic
  T5Stack-> 
     logic 
     loop(T5Block, T5Block, T5Block, ...) ->
     logic
  logic

into

Pipe(
  Sequential(
    T5ForConditionalGeneration,
    T5ForConditionalGeneration_p1,
    T5Stack,
    T5Stack_p1,
    T5Block,
    T5Block,
    T5Block,
    ...
    T5Stack_p2,
    T5ForConditionalGeneration_p2,
    T5Stack,
    T5Stack_p1,
    T5Block,
    T5Block,
    T5Block,
    ...
    T5Stack_p2,
    T5ForConditionalGeneration_p3,
  )
)

I think we don't need to Sequentialize any further beyond T5Block, but we will have to see down the road.

Problems:

  1. Can't change the structure of the model because of the pre-trained weights.
  2. The inputs/outputs are very complicated because the entry into the Pipeline (first and last stages) can only be a tuple of pure Tensors.
  3. The inputs/outputs besides required to be Tensors have to expose first dimension to be batch-dimension since it slices all inputs and restores all outputs on that dimension on the way to/from forward (but only on the very first and last stages of the sequence)

I did successfully implement a t5-pipeline version #9765 that uses 2 shorter pipes, as it was natural to convert a loop over T5Blocks to Sequential and it now looks like this

T5ForConditionalGeneration->
  logic
  T5Stack-> Pipe(Sequential(T5Block, T5Block, T5Block))
  logic
  T5Stack-> Pipe(Sequential(T5Block, T5Block, T5Block))
  logic

using pytorch pipe in a very painful way overcoming problem n2. But it's doubtful this approach will work with any other 1D Parallel side (e.g. combining with Sharded DDP) - definitely doesn't work with DeepSpeed Zero-DP.

But that implementation won't work with DeepSpeed pipeline - it has to be Sequential from the top-level. Not sure about fairscale yet.

So I'm trying again, this time starting by just trying to Sequentialize the layers while overcoming problem n1.

If you do look at the code, please ignore everything in the diff but modeling_t5.py (and I removed a lot of the model parallel code as it is getting in the way and it won't be needed if we figure out the pipe - since pipe(chunks=1) == naive vertical MP, so we get all the complex things that MP currently does for free. But we have to do even more complicated things instead. Naive vertical MP appears to be trivial compared to the changes required to make pipe work.

You can see the process of conversion in this PR, I Sequentialized:

  1. the T5Block-loop
  2. the 2nd half of T5Stack,

now I need to continue breaking up the structure upstream. At this stage there is no Pipe in the code, the 1st main difficulty is to Sequentilize the layers.

If you want to see just how I converted the T5Block-loop into Sequential, it is this commit - might be easier to see: 4c0ea52 The input/output have to be the same because Sequential sends the output of one stage to the input of another.

If you have some brilliant ideas that I'm perhaps missing at how to easily Sequentialize t5 layers I'm all ears.

@patrickvonplaten, @sgugger, @LysandreJik

@github-actions
Copy link

github-actions bot commented Mar 6, 2021

This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions.

If you think this still needs to be addressed please comment on this thread.

@github-actions github-actions bot closed this Mar 6, 2021
@stas00
Copy link
Contributor Author

stas00 commented Mar 6, 2021

go away bad bot

@stas00 stas00 reopened this Mar 6, 2021
@stas00 stas00 added Feature request Request for a new feature WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress and removed Feature request Request for a new feature wontfix labels Mar 6, 2021
@stas00
Copy link
Contributor Author

stas00 commented Jun 4, 2021

too long. closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Pipeline Parallel WIP Label your PR/Issue with WIP for some long outstanding Issues/PRs that are work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant