-
Notifications
You must be signed in to change notification settings - Fork 27.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[model parallelism] Bart goes parallel #9384
Conversation
That looks great! Model parallelism would be very nice for Bart. We should coordinate here a bit with all the open PRs. I'm also more or less done with the big "split-bart-into-separate-models" PR: #9343. I'd propose the following: |
It's very ready, functionality/concept-wise. It's not ready 100% commentary, debug traces, etc. but that's very unimportant until the rest is sorted out, since there are multiple overlapping PRs happening. Because of the holidays there is a lot of new code which is all inter-dependent and unreviewed and then there is a huge change to merge of #9343. So I think it's the best to review it as it is - sort things out and then once everybody is happy with the logic, and #9343 I will most likely have to do a new PR anyway. But I need your feedback that what I did is correct. Think of it as an ongoing code design and generalization PR. Thanks. |
@patrickvonplaten, your plan works for me. Please ping me when #9343 is in and you're happy with the outcome so that it'll be safe to add MP w/o creating conflicts. Thank you. But as I commented above this blocking event doesn't have to interefere with this PR's review - we are just not going to merge it, but it should proceed as normal and I will take all the agreed upon changes to the new PR once the dust around Bart split settled down. |
@stas00 @LysandreJik @patrickvonplaten This PR introduces a Has everyone read this comment? Are we all on board for the plan to generalize model parallelism? Don't have to implement it now, but we need to make sure we've thought through any changes that affect user experience and backward compatibility. Sorry, I'm in the middle of moving so not keeping close track of all the traffic and could easily have missed something. Also, this content is spread across several PRs, so sometimes I'm getting confused. |
@alexorona, I'm basically leaving all the old code in place, so that gpt2 works as is and t5 as is, so this PR only impacts Bart. And in any case it doesn't look like this PR will be merged since Bart went through a split #9343, which isn't finalized yet and I will need to re-do it anyway. But it's no problem, since I know what to do. And see the end of this comment - the whole MP implementation might need to be redesigned altogether. Since there are so many moving parts, it's very difficult to manage things and definitely makes things difficult for reviewers. So my intention was to merge each of the new things separately, while keeping the old code working and then to start integrating things in bit. The holidays made things pile up, but since the HF team is back I trust in the next few days we will form a plan. Important notes:
Until then we have deepspeed integration almost ready and |
From what I understand, model parallelism as it's currently implemented is a naive implementation of what it's supposed to do: offer more memory so that bigger models may be trained using memory of several devices instead of a single device. It is indeed inefficient as devices as idle while others compute, so there's definitely a way of making it more efficient. @stas00, @alexorona, if you could walk us through what you have learned so that @patrickvonplaten, @sgugger and myself can understand the different options available, that would be great. Some thoughts to structure our approach towards MP:
|
@LysandreJik No, it's not a naïve implementation of model parallelism. In addition to data parallelism and model parallelism, there is pipeline parallelism, which is the next level of complexity along with zero redundancy. Model parallelism allows us to train bigger models on GPU. Pipeline parallelism + model parallelism would allow us to train these large models faster because the GPUs are not idle. I really think the next step is to make sure model parallelism is generalized and rely on a library -- probably deepspeed -- to implement pipeline parallelism and zero redundancy. deepspeed has something called 3D parallelism, which I believe is a form of pipeline parallelism. @stas00 is that correct? From my understanding, deepspeed has three major enhancements:
Practical feature implications: We can currently train t5-11b -- I believe the largest model in the library -- in a practical and affordable amount of time on the newest cloud instances. There are three benefits to pursuing pipeline parallelism and zero redundancy libraries:
|
Some notes following up to the raised issues:
My plan is to finish the DeepSpeed integration - almost there and then look into Pipelines next. Of course, nobody needs to wait for me, I'd be just as happy for others to experiment and teach me instead ;) I commented on the current design so that the HF team better understand what we have here: |
I rebased on #9343 so now it's no longer possible to develop anything on Bart - the check fails because it wants all copy-cats to be the same:
How do I move forward with my work then? I suppose the only way to proceed is to drop Bart and use one of the derivatives? So Bart isn't going MP... |
That's also why we should pause the BART PR for MP and make sure the general API is solid enough. Any change in BART will impact all related models (that was true before the split, since the other models were subclasses) so the same PR will need to do BART/Pegasus/mBART/marian etc. And probably the ses2seq template. So better make sure we're happy with the design on a model independent from the others like GPT-2 or T5 first :-) |
I agree with @sgugger that it would be better to just work on TF and GPT2 until we have a solid API for now...But in general the idea is to implement the feature in Bart and then run |
And big sorry for making this PR so much harder for you now! But that Bart split had to happen sooner or later |
Surprisingly, the rebasing was super-simple. So it wasn't a hurdle at all. |
If I switch to one of the original subclasses, say, MBart, and work with it instead - will the copy-checker complain just the same? |
I'm afraid so, unless you remove all |
Understood. thank you! It sounds like this change will make future development of the bart family somewhat painful. Since the developer will have to constantly sync multiple files with their new development and it won't help the reviewers since now there will be multiple duplicated diffs. It'd be much more useful to run the check/sync periodically or at will, rather than enforcing them on each |
Thinking more about the situation - the thing is - this PR works - I put a ton of work into it - users can start using MP with the Bart family yesterday, e.g. with I'd just need to enable But it's your call. |
That's one of the thing to clean up: this flag is not necessary with the current API: we can detect if a model is parallelized and avoid a confusion with the name. I'm not saying we should throw this PR in the thrash, just that it should be paused until we have had time to do all clean up we want. |
Do tell more? Are you planning to launch MP just because a model supports it? It sounds that you are considering dropping the Or are we talking about different things?
tldr;
the full story: The issue is simple. Is that things are complicated. This PR overlaps with #9323 - both have multiple changes and improvements, and I have already documented and commented on each one of the changes in both PRs, well actually 3 PRs (this one too #9316), so leaving such code spread out over several PRs is a recipe for a huge mess down the road. It all came to be as I was working over the holidays and wasn't getting feedback (No complaints, I'm just explaining how it came to be.). As a result of it I was working on new changes but with Bart so that I could see how to generalize better. Not knowing what you'd decide I tried to leave the existing code without any API changes, hence the separate independent PRs. The bottom line is this. Regardless of whether the current implementation is efficient or not, it works. And any future more efficient implementation will use the same API on the user-side (or perhaps something more complicated) - at the moment its just one command to turn the feature on. So you actually can send users to this PR branch if they want to use MP with Bart-only. So the other approach I can take is to merge parts of this PR into t5-mp PR #9323, but it'll be again a lot of work and nobody has even looked at any of those PRs... But then we are talking about perhaps finding a more efficient solution, and perhaps deepspeed will render a lot of it pointless anyway... (Alex thinks not.) So why waste reviewers' time... makes sense not to. So yes, let's freeze this up and I go back to work on deepspeed. I have convinced myself it's the right thing to do and you got to hear my inner talk. Just remember it's totally functional in case someone needs it. Thank you for reading. |
As t5 MP is broken in the trainer, I needed to see if it was the same with my Bart MP port - but it works:
So with this PR you can use |
As I was trying to see if I can find a way to utilize the idling GPUs, I run these benchmarks - haven't found anything useful yet, but the interesting finding is that while we get a huge performance hit with evaluation and beam size > 1, actually the training time is faster than non-MP version, despite all the data copying This PR beats master on training time almost by half 8.6sec vs 15.8 sec, but of course it has 2 gpus vs 1 gpus!!! But it beats even the DDP solution 10.6sec by 20%! So perhaps there is something good in here we just need to understand why is it faster than DDP. Unfortunately I have an uneven GPUs setup, so it's hard to get very useful benchmarks. Perhaps someone with 2 identical GPUs could re-run these and report back. For posterity here are the results I'm getting with 1x 8gb and 1x 24gb gpus:
|
This issue has been automatically marked as stale and been closed because it has not had recent activity. Thank you for your contributions. If you think this still needs to be addressed please comment on this thread. |
too long. closing. |
Hello, @stas00 is there any update on BART based model parallelism? also about model.parallelize() for BlenderBot? Thanks. |
This line of work has been abandoned as it's highly inefficient. Please use DeeepSpeed which works with any model https://huggingface.co/docs/transformers/main/main_classes/deepspeed |
This PR implements model parallelism (MP) in Bart.
This is the latest incarnation of generalization of the MP in
transformers
, based on @alexorona's original work. I have done some of it already in #9323 and this PR builds upon the other one. It's slightly complex what to merge when, but this PR is independent and can be merged on its own.For reviewers I propose to read things in this order:
If all is in agreement, I propose:
deparallelize
- the latter is not really needed in practice so will handle those when dust around design settles.--model_parallel
flagsmodel_parallel_utils.py
transformers
has implemented.Actually first we need to merge smaller bits:
So this PR:
Implements MP in Bart based on discussions in all of the threads/PRs listed above. Only
BartForConditionalGeneration
at the moment while we are sorting out the API. But the bulk of the work is done, sinceBartModel
has all in place.switches to the concept of
main_device
rather than(first|last)_device
so the first device of encoder becomes the main_device and almost everything happens there (embeddings
,lm_head
, etc), and other devices are used exclusively for encoder and decoder purposes.switches to a more explicit
device_map
that can support non-symmetrical models (different number of layers in encoder and decoder). It can also handle different types of maps. See the demo at the end this post for details.further improves the magical
to()
functions that can operate on any type of variable except opaque objects. Can be used to put the inputs on the correct devices either automatically via aforward
decorator or explicitly insideforward
. We could use either or both.adds a bunch of debug functions that make it easy to trace device IDs of variables, params and whole layers.
further improves the device map validation function
improves tests
needs to remove apex.normalization.FusedLayerNorm as it's buggy under MP (corrupts data) per replacing apex.normalization.FusedLayerNorm with torch.nn.LayerNorm #9377 a dedicated to removal PR is replace apex.normalization.FusedLayerNorm with torch.nn.LayerNorm #9386
Here is a quick demo (you will need 2 gpus to run it):
You can see from the demo, that when calling
model.parallelize
you can skip thedevice_map
arg altogether and the model will generate the right one. Or you can provide one that:the model transparently handles all the remappings
Note, the user still needs to put the data on the
main_device
, so perhaps that will eventually become not hardcoded via:As we have been discussing elsewhere the device map format is unstable yet. So I propose we document it as unstable yet, but the users can rely on the autogenerated device map which requires no input from the user (i.e. calling `model.parallelize() ) - if it changes it'll happen transparently for the user.
Also note that in situations of Trainer-based scripts, like
finetune_trainer.py
, the user has no way to supply such device map at the moment so in effect the model generates the map on the fly as in the above para.Fixes: #8344
@LysandreJik, @patrickvonplaten, @sgugger, @alexorona