-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
accelerate deepspeed and gradient accumulation integrate #23236
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Looks great, once tests pass fully :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG2M, one clarification which indeed is true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this. Is the diff longer than expected because of other PRs to be merged before?
Might be cool to Have Stas have a look (not pinging him here too early) once this is ready to merge and tests are confirmed to all pass.
Due to updating from main, it is not showing the diff wrt previous branches. Weird.
Yes, definitely. All tests are passing already. Checked the slow tests offline. |
…elerate-deepspeed-integrate
@sgugger, now the diff is only specific to DeepSpeed changes + gradient accumulation changes + saving/loading changes wrt previous PR. |
Hello @stas00, please review this PR which aims to shift the accelerate handling in Trainer to Accelerate. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pacman100, I think the main concern is that this PR appears to be breaking BC in at least a few places - please correct me if I'm wrong. I think those cases are super minor and probably won't cause too many breakages for the users if any. I'll leave up to you to decide.
As I don't know all the nuances of Accelerate's Deepspeed integration I can't do a detailed review, but if all the SLOW tests pass it should be good.
request: as Accelerate is taking over please remove me from the Issues/PR templates as deepspeed integration maintainer while you're at it, since I won't be able to support users any longer.
Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! Looks great
…#23236) * mixed precision support via accelerate * fix issues * fix for the sharded ddp case * fix flax and tf failing tests * `refactor the place to create `Accelerator` object * move ddp prep to accelerate * fix 😅 * resolving comments * move fsdp handling to accelerate * fixex * fix saving * shift torch dynamo handling to accelerate * shift deepspeed integration and save & load utils to accelerate * fix accelerate launcher support * oops * fix 🐛 * save ckpt fix * Trigger CI * nasty 🐛 😅 * as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate * make tests happy * quality ✨ * loss tracked needs to account for grad_acc * fixing the deepspeed tests * quality ✨ * 😅😅😅 * tests 😡 * quality ✨ * Trigger CI * resolve comments and fix the issue with the previous merge from branch * Trigger CI * accelerate took over deepspeed integration --------- Co-authored-by: Stas Bekman <stas@stason.org>
…#23236) * mixed precision support via accelerate * fix issues * fix for the sharded ddp case * fix flax and tf failing tests * `refactor the place to create `Accelerator` object * move ddp prep to accelerate * fix 😅 * resolving comments * move fsdp handling to accelerate * fixex * fix saving * shift torch dynamo handling to accelerate * shift deepspeed integration and save & load utils to accelerate * fix accelerate launcher support * oops * fix 🐛 * save ckpt fix * Trigger CI * nasty 🐛 😅 * as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate * make tests happy * quality ✨ * loss tracked needs to account for grad_acc * fixing the deepspeed tests * quality ✨ * 😅😅😅 * tests 😡 * quality ✨ * Trigger CI * resolve comments and fix the issue with the previous merge from branch * Trigger CI * accelerate took over deepspeed integration --------- Co-authored-by: Stas Bekman <stas@stason.org>
…#23236) * mixed precision support via accelerate * fix issues * fix for the sharded ddp case * fix flax and tf failing tests * `refactor the place to create `Accelerator` object * move ddp prep to accelerate * fix 😅 * resolving comments * move fsdp handling to accelerate * fixex * fix saving * shift torch dynamo handling to accelerate * shift deepspeed integration and save & load utils to accelerate * fix accelerate launcher support * oops * fix 🐛 * save ckpt fix * Trigger CI * nasty 🐛 😅 * as deepspeed needs grad_acc fixes, transfer grad_acc to accelerate * make tests happy * quality ✨ * loss tracked needs to account for grad_acc * fixing the deepspeed tests * quality ✨ * 😅😅😅 * tests 😡 * quality ✨ * Trigger CI * resolve comments and fix the issue with the previous merge from branch * Trigger CI * accelerate took over deepspeed integration --------- Co-authored-by: Stas Bekman <stas@stason.org>
What does this PR do?
accelerate launch
with trainer for DeepSpeed, e.g.:Usual run using
torchrun
and trainer args is unimpacted: