Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pure dp performance] Optimize fused allreduce in raw program #34509

Merged
merged 5 commits into from
Aug 11, 2021

Conversation

FeixLiu
Copy link
Contributor

@FeixLiu FeixLiu commented Jul 30, 2021

PR types

Performance optimization

PR changes

Others

Describe

  1. Increase the overlap of the calc stream and comm stream after fusing the allreduce vars.
  2. Move the coalesce_tensor before grad generation, so that the copy_data is not needed.

Test

GPT model with dp=8

Without Fuse With Fuse Gain
FP32 68889 tokens/s 70822 tokens/s +2.3%
AMP 123968 tokens/s 133408 tokens/s +7.6%

FP32 Loss curve
Screen Shot 2021-08-03 at 6 46 32 PM
Max Diff: 0.038

FP16 Loss curve
Screen Shot 2021-08-03 at 7 05 15 PM
Max Diff: 0.073

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@FeixLiu FeixLiu force-pushed the allreduce_optim branch 2 times, most recently from fce35ac to fc3458f Compare August 5, 2021 01:40
@FeixLiu FeixLiu force-pushed the allreduce_optim branch 2 times, most recently from c82093f to 11928ea Compare August 10, 2021 06:49
Copy link
Contributor

@wangxicoding wangxicoding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wangxicoding wangxicoding changed the title Allreduce optim Optimize fused allreduce in raw program Aug 11, 2021
@wangxicoding wangxicoding merged commit 4d2994c into PaddlePaddle:develop Aug 11, 2021
@FeixLiu FeixLiu deleted the allreduce_optim branch August 11, 2021 06:29
@FeixLiu FeixLiu changed the title Optimize fused allreduce in raw program [pure dp performance] Optimize fused allreduce in raw program Oct 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants