-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[hybrid enhance] add flag to control the avg position for grad merge under pipeline mode #36384
Merged
wangxicoding
merged 13 commits into
PaddlePaddle:develop
from
FeixLiu:pp_grad_merge_more_option
Oct 14, 2021
Merged
[hybrid enhance] add flag to control the avg position for grad merge under pipeline mode #36384
wangxicoding
merged 13 commits into
PaddlePaddle:develop
from
FeixLiu:pp_grad_merge_more_option
Oct 14, 2021
+263
−2
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Thanks for your contribution! |
python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
Outdated
Show resolved
Hide resolved
python/paddle/distributed/fleet/meta_optimizers/sharding_optimizer.py
Outdated
Show resolved
Hide resolved
wangxicoding
approved these changes
Oct 13, 2021
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
# the gradient to meet the logic for gradient merge under pure dp. | ||
tmp_first_opt_idx = None | ||
for idx, op in enumerate(main_block.ops): | ||
if is_optimizer_op(op) and op.type != 'c_sync_comm_stream': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个插在c_sync_comm_stream前也没有关系
FeixLiu
added a commit
to FeixLiu/Paddle
that referenced
this pull request
Nov 25, 2021
…under pipeline mode (PaddlePaddle#36384)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
New features
PR changes
Others
Describe
Support pure dp kind 'average for gradient merge after merging' operation for pp mode.
Add a new flag inside gradient_scale_configs:
scale_gradient
, which default value is False, to control where to do the average operation for gradient merge.If
gradient_scale_configs['scale_gradient'] = False
, the pipeline will scale the loss@Grad (divide by number of micro steps) for each micro step, then do the gradient merge. It does the gradient merge avg op before sum.Otherwise, if
gradient_scale_configs['scale_gradient'] = True
, the pipeline will do the gradient merge firstly. Then, during the optimize stage, the merged gradient will be divided by the number of micro steps. It does the gradient merge avg op after sum.Hot to use:
Loss compared under fp16_allreduce
