Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update trainer for easier handling of accumulate, compile fixes, and proper reporting #34511
Update trainer for easier handling of accumulate, compile fixes, and proper reporting #34511
Changes from 4 commits
ec49756
4cdee53
fb8070f
9c6ed74
aab5467
c56ffe6
4a8a2a3
18abcb5
2f41eb7
27eaadc
6fcb0b5
43f6a2f
238c985
93f36e8
c633219
c452194
7739461
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For an explanation on what we have going on here @Rocketknight1 , during DDP we use
model.no_sync()
to only communicate across all GPUs during the next step outside it (so we speed up training when not needed when doing gradient accumulation).accelerator.no_sync()
is the lower-levelaccumulate()
API which makes that op backed-independent (so on a single GPU it just doesnullcontext
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused that the loss is no longer multiplied by the gradient accumulation steps here, because the loss has been multiplied by the data parallel size in https://github.com/huggingface/transformers/pull/34511/files#diff-ed55888e6665791fe92cc8fc0c499da54f4ace6738551cd9a2591881cda076deR3702-R3703
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, it should be solved in #35207