-
Notifications
You must be signed in to change notification settings - Fork 22.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] [Pipe] supporting None and non-Tensors in forward's input/output #50693
Conversation
Thank you very much for tagging the right people, @blefaudeux! |
Here is a related discussion specific to DeepSpeed PP: microsoft/DeepSpeed#659 |
I really like this proposal @stas00! As per my understanding
I'm not sure I understood this currently, why does the pipe need to return
This requirement doesn't seem to be covered in your proposal? |
Thank you for your validation, @pritamdamania87
That's exactly right. It may contain tensors too, so this
Indeed. I think it'll have to fit into Basically, currently I'm dealing with a complex structure of batch sized tensors inside a tuple inside another tuple, and the tensors aren't even of the same shape (2 different shapes) - that's why they are in tuples and not a tensor in first place. So since it needs to be spliced for micro-batching, I don't think we have any choice here but to remap that structure to a normal tensor and then reconstruct it on the other side to what it originally was. I will have to convert it to 2 tensors so that I could match the sizes. I wish we had Mind you we are converting what used to be a complex loop over ModuleList with a full control over what was passed to |
I wonder if we could just have But I guess we would need to stick to a single dict variable and not all On the other hand, the func signature could be even the totally normal:
except with a restriction that non-keyword args will have be a I guess the biggest inconsistency then would be - how then do you return the same back, python doesn't quite have a feature to do:
I guess it will have to be packed into a dict then:
If we do that, then following up to my answer to your question about why bother returning the Thoughts? This proposal is also backward compatible with the existing API as it extends the functionality and should work just fine with the original much more restrictive API. |
For posterity, to bypass that tensor only restriction I tried to solve the data aggregation problem using a closure:
and passing it to the pipeline wrapper class, and I was getting everything messed up, until I decided to print out the order each stage + micro-batch gets executed and quickly (and obviously) saw that in the pipeline the order of execution is very unpredictable. In this printout the first number is the block id, the second is the micro-batch id - we have 6 blocks and 2 micro-batches.
So scratching that idea and will try a different approach instead. I think I may sort out how to flatten those tuple in tuple structures into tensors and then re-store them on the other side of the pipeline to what the application expects. I will post an update if and when I make it work. Edit: omg, I made it work with using a closure and slots which ensured the correct insertion order. I track each block id and each micro-batch id:
and then I had to reconstruct it back to the correct tuple of tuples and manually merge the microbatches. and switch all tensors to the same device. I'm yet to figure out how to go about doing it in a simpler way, but this is my first success. I made PP work with t5 and a lot of hardcoded hacks. next cleaning it up and generalizing it! I think the other more complicated solution would be to prepare an empty tensor of the correct dimensions and then pass it to forward and back through each block and stage - again fill the right slots but each of these is so difficult to think about. Perhaps after doing it a few times I will get the hang of it. |
Thanks for all of the detailed comments @stas00! I was thinking if the following contracts made sense:
Putting this all together in an example might help. So let's say our original model is as follows:
We can represent this into a pipe as follows:
The high level idea here is that there is tight coupling between the first stage of the pipeline and Pipe's forward method and a similar tight coupling between the return type of the last stage and the return type of Pipe's forward. Apart from that, we allow the rest of the stages to be defined arbitrarily by the user (as long as the signatures between subsequent stages is compatible). |
Thank you for writing out that clear proposal and the examples. Overall, yes, to everything you said. wrt the 3rd point, it can be exactly as the originally proposed inputs signature of the first stage
Here is the concrete obstacle I've been trying to overcome wrt return values. I'm converting:
into:
So you can see that in the case I've been working on the first stage of the pipe and all the other stages have to be the same, that's why here the last stage has to return the same output as the first stage. So it has to be the same as inputs. Does this help? At the moment I used a really complex solution of using a closure with a simple python 2D list with slots, which I fill out through the pipe keeping track of the depth of the stack and the the micro-batche ids to know where to insert the aggregate chunk and then manually re-constructing the data on the exit from the pipe. Terrible, but it works. And you're absolutely correct about intermediate stages not needing any restrictions - it's only the very first stage that has to behave in a very restricted way. I appreciate you clarifying this. I certainly missed this point and I think made my life much more complicated. I need to sit and ponder some more and I will post back if I get to see the light. I'm very grateful for your feedback and explanations, @pritamdamania87. |
Thanks for providing the concrete example, helps a lot in understanding the problem! As per my understanding for the example you mentioned above, setting up the pipeline would look like:
Although, I'm not sure how aggregate would be appropriately accumulated in the pipeline. For example lets assume there are two microbatches with inputs: [0, 1] and [2, 3] respectively. Also, lets assume the block simply outputs the input it receives for simplicity. Now the output and aggregate for microbatch 0 at the last stage of the pipeline is:
For microbatch 1 at the last stage it would be:
Now the pipe can return a combined output of [[0,1], [2,3]] by concatenating on the batch dimension, however we can't just return aggregate as is since the actual value that we want is 6 + 2 = 8. So don't we need to have some way of aggregating non-tensor values as well in this case? |
The problem is that the aggregate is not a number but a tuple of tuples of tensors: This what goes in and a similar structure with different contents needs to go out. (well a few of those) You can see that the first tuple is of size 6 - we have 6 stages in the pipeline. So each stage takes one of 6 and leaves another of these on the way out. Then there is a tuple of size 4 - which are 4 different keys, again it uses all these and leaves another set on the way out. Each tensor is of a different dimension - that's why I think they were forced to use tuples in first place. Then you can see the actual tensors of batch size 3. Which is a problem, since the batch-size to be sliced on is hidden deep inside. It's trivial when it's in the original loop and the batch size remains unchanged, but in this situation it's nuts. As I said I made it work using a closure to aggregate and a lot of composing and recomposing from tensors back to python structures, And on the way in I invert this structure and stack it into a huge tensor, so that the batch dimension is first to be spliced on. Then on the other side I recompose it back to a tuple of tuple of tensors, by chunking it twice. This is not a code that can be left in production. I don't think it's efficient either. I think the whole thing needs to be rethought, but this is how most transformers models are written. Of course, they weren't written to lend to an easy pipeline conversion. |
Thanks for providing a detailed example. I'm wondering if we could recursively enter a Tuple/List and slice Tensors we find inside them? Today the Pipe API does this with one level of a Tuple, but I think we should be able to support arbitrarily nested Tensors too. Would this resolve the issue here or is the batch dimension across Tensors in the Tuple of different sizes too? Since the inputs here seem fairly complicated, I'm wondering if you could share some steps in terms of setting up and running this Transformer model. If I can run it locally and inspect the inputs myself, I'll probably have a much better idea of whether there is something we can do to support such models in the pipeline. I have very basic understanding of Transformer models, but I was wondering if the Transformer models could be rewritten to be similar to what we have in our sample benchmarks: https://github.com/pytorch/pytorch/blob/156da22566fc8c49065a6075a4e1352bf4de63d9/benchmarks/distributed/pipeline/pipe.py. Looking at https://github.com/huggingface/transformers/blob/357fb1c5d8b6a16f042f9b504f023d935086e8e5/src/transformers/models/t5/modeling_t5.py#L600-L612, it seems like that API is more functional where things like hidden_states are passed into the forward function instead of being initialized as part of the Block. |
@stas00 Another example of modularizing Transformers for pipelining can be found here: https://github.com/pytorch/fairseq/blob/master/fairseq/model_parallel/models/pipeline_parallel_transformer/model.py. I was wondering if this would work for your use case? |
I think this would be super magical, yes, that would have saved so much trouble.
Yes, of course. It's very simple.
and now here is a very simple script I've been using to test:
Now the stack goes for this
I omitted a few small extra layers, but this is the bulk of it. And each As you can see the stacks lend really well to a pipeline, so my initial attempt is not to convert the whole model to Pipe, but only the repeated blocks, resulting in 2 pipes, so currently my attempted version I'm working on symbolically looks like:
Please let me know if you have any questions or if in any way I can help you to quickly understand it. I find that using a good debugger like pycharm makes it much easier to quickly navigate through the stages and easily visualize the parameters.
Yes, this is what I have been using as the example. As you originally recommended I look into it. Thank you for all the other links, @pritamdamania87 - I will study them in the next few days. |
@pritamdamania87, so what needs to happen for the adjusted by you proposal to become a reality? Would be awesome to have the main functionality working and well-tested before 1.8 is released. Also there are a few more issues to discuss - should I open a separate issue about those? We need these 2 supported by DeepSpeed:
Thank you! |
Hey @stas00, I've been tied up with a few things related to the 1.8 release and I'll get to this once that is resolved. Regarding the 1.8 release, we're releasing pipeline parallelism with the exiting API. Although, this is still a "beta" feature, so the enhancements proposed here can be included in 1.9.
Yes, it would be good to open separate issues for those. |
Here you are, @pritamdamania87. I appreciate the update and will patiently wait till your other tasks are completed. |
Pipe supports shared-parameters by default as long the partitions with the shared-parameter are mapped to the same device:
What type of models are you using shared-parameters in? So far, we've only seen shared-parameters being used for the shared-embedding table used in Machine Translation models.
|
Thank you for your follow up, @msbaines - since I created a dedicated issue for this feature let's continue there: #51931 (comment) |
Looking at the simple script it seems like we use only |
I think you're thinking and lots and lot happens between it and eventually which in the case of |
@stas00 Yes you are right, I was thinking about Btw, can you create a gh issue for this and we can continue the discussion there :) It's easier to keep track of gh issues instead of discussing this on a PR :) |
Nvm, I was looking at only the first few log lines it looks like the later T5Blocks have most of the input arguments populated. |
Yes, the first time the model is called from I'd be happy to open a dedicated Issue - I'm just a bit lost on which specific topic? I know my PR turned into a small battle field - in a sense that we discuss multiple things.... |
I guess creating an issue with your original PR summary would be a good idea :) Basically, the issue I'm interested in tracking is if there is a nice way to incorporate these T5 blocks into a pipeline. |
After spending some time poking around, it looks like the limitation of Tensor or Tuple of Tensors comes from the fact that PyTorch's autograd machinery has this limitation when it comes to checkpointing: https://github.com/pytorch/pytorch/blob/master/torch/csrc/autograd/python_function.cpp#L369. This is basically the error that you were seeing:
The restriction of Tensor/tuple of Tensors comes from the autograd machinery since it is hard to support arbitrary datatypes there. Although, I think I might have some workaround that might work here. This is what the input structure for the T5Block looks like when I ran your example locally:
This is slightly hacky, but I think it would work and the core idea is to represent everything in a flat tuple of Tensors. |
Thank you for getting to the root of this limitation, @pritamdamania87! That helps a lot to know why that limitation is there in first place! Thank you for proposing the hacky workaround. I think I have mentioned a few weeks ago that I did make it work with 2 pipelines using slightly different type of hacks: huggingface/transformers#9765 Except the outcome is very inefficient - I can't break 50% gpu util over 2 gpus with any chunk size. So it's kind of pointless at the moment, just as well use a naive MP and have just one gpu working at a time (which is a terrible waste of resources). The problem is that any of the hacks we may have to adopt leads to quite complex code. And since And as I mentioned elsewhere sagemaker as of recent provides pipeline support to any model, not requiring it to be |
If you still have some of this code, is it possible to share some steps to run it so I can try to run it locally to see what might be causing low gpu utilization? |
It's fully documented in the PR huggingface/transformers#9765, I can't link directly to the headers, but you will find copy-n-paste Setup followed by 2 Deployment versions - one via a custom script (ready to be run) but can't appreciate performance in this one, and then via the HF trainer where you can push lots of data and thus enough runtime to measure utilization. The full run command is under the "Benchmarks" section - the baseline and then with pipeline enabled. Please let me know if you get stuck anywhere, I tried to make the reproduction an easy copy-n-paste. |
Apologies, I neglected to do it sooner, here it is: #53952 Thank you, @pritamdamania87! |
I was trying to convert
transformers
t5 to use the new Pipe and I encountered a dozen of input args that some are Bools and some are optionallyNone
, yet others tuples within tuples. You can see an example of the complex inputs it gets:https://github.com/huggingface/transformers/blob/357fb1c5d8b6a16f042f9b504f023d935086e8e5/src/transformers/models/t5/modeling_t5.py#L600-L612
Currently
Pipe
only handlesinput
/output
which is either a single Tensor or a tuple of Tensors, so we can't use it as it stands now.A user needs to be able to pass as a part of the
input
andoutput
tuple:None
- in transformers these are passed toforward
when the thing needs to be optionally generated by downstream layers, but is a Tensor at other timesto()
is needed see below.Bottom line, a whole bunch of variations of variables might be needed to be passed to and from the
forward
function in the pipe chain. As long as these structures can be traversed and switched.to()
the right device and spliced where this is needed, any type of variable should be supported. Please, see below.I made
microbatch.py
work withNone
s - this PR, but stumbled upon errors in C++ implementation:I'd love some help with this and also to figure out the Bools and other structures.
I'm very open to a different way of resolving this issue as well.
Proposal
To summarize I propose to change the user-side
forward
inPipe
to support the following requirements:where the
input_to_slice
tuple:Tensor
or a tuple of(Tensor|None)
.None
must be supported ininput
, sinceNone
is that only sometimes, and normal input data at other times. So micro-batch splice if it'snot None
, and passNone
otherwise.to()
basically, what we have now plus supporting any number of
None
s in the tuple.and then add an optional
input_to_copy
tuple, which:None
.to()
implemented)to()
by recursively traversing the structure. Here is a recursive_to function that does the recursive traversal. may need to add support for objects which implement.to()
Each stage of the
Pipe
will receive:self
input_to_slice
switched to the right deviceinput_to_slice
switched to the right devicethe
Pipe
will return:input_to_slice
(updated to outputs)input_to_slice
(updated to outputs)The names are just to make the proposal clear - surely they should be named something else should the proposal be accepted.
Thank you!
p.s. If I'm not mistaken pytorch
Pipe
is derived from/modelled after FairScale and there is also DeepSpeed's implementation, so let me know if I should work it out with one of the above first and then sync with pytorch?tagging @blefaudeux from FairScale and @ShadenSmith from DeepSpeed - in case you have some insight - and since we need all 3 sets of APIs to agree I believe. If it's someone else at your org that is in charge of that domain please kindly tag them instead. thank you!
tagging @pritamdamania87 who originally suggested I use Pipe here.