[pure dp performance] Optimize fused allreduce in raw program #34509

FeixLiu · 2021-07-30T08:05:51Z

PR types

Performance optimization

PR changes

Others

Describe

Increase the overlap of the calc stream and comm stream after fusing the allreduce vars.
Move the coalesce_tensor before grad generation, so that the copy_data is not needed.

Test

GPT model with dp=8

	Without Fuse	With Fuse	Gain
FP32	68889 tokens/s	70822 tokens/s	+2.3%
AMP	123968 tokens/s	133408 tokens/s	+7.6%

FP32 Loss curve

Max Diff: 0.038

FP16 Loss curve

Max Diff: 0.073

paddle-bot-old · 2021-07-30T08:05:57Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py

wangxicoding

LGTM

FeixLiu force-pushed the allreduce_optim branch from 15f25c8 to bc2dad7 Compare July 30, 2021 08:06

FeixLiu force-pushed the allreduce_optim branch from bc2dad7 to 9da8eb4 Compare July 30, 2021 09:48

FeixLiu force-pushed the allreduce_optim branch from 9da8eb4 to 2f2ddb1 Compare August 2, 2021 07:21

FeixLiu force-pushed the allreduce_optim branch from 2f2ddb1 to 0c63f62 Compare August 2, 2021 07:50

wangxicoding requested changes Aug 2, 2021

View reviewed changes

FeixLiu force-pushed the allreduce_optim branch 2 times, most recently from c54345a to ed8f7eb Compare August 3, 2021 05:19

FeixLiu force-pushed the allreduce_optim branch from ed8f7eb to c10ad6b Compare August 3, 2021 05:27

FeixLiu force-pushed the allreduce_optim branch from c10ad6b to 59d652d Compare August 3, 2021 05:52

FeixLiu force-pushed the allreduce_optim branch 6 times, most recently from bd8cf8f to 6c8b8ab Compare August 3, 2021 10:23

FeixLiu force-pushed the allreduce_optim branch from 6c8b8ab to 26158cc Compare August 4, 2021 01:23

wangxicoding reviewed Aug 4, 2021

View reviewed changes

FeixLiu force-pushed the allreduce_optim branch from 8c04ed8 to 8ecd256 Compare August 4, 2021 06:31

FeixLiu force-pushed the allreduce_optim branch from 8ecd256 to 83a3098 Compare August 4, 2021 06:46

FeixLiu force-pushed the allreduce_optim branch from 83a3098 to f4fa1a3 Compare August 4, 2021 08:17

wangxicoding requested changes Aug 4, 2021

View reviewed changes

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

wangxicoding reviewed Aug 4, 2021

View reviewed changes

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

wangxicoding reviewed Aug 4, 2021

View reviewed changes

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

FeixLiu force-pushed the allreduce_optim branch 2 times, most recently from fce35ac to fc3458f Compare August 5, 2021 01:40

wangxicoding reviewed Aug 5, 2021

View reviewed changes

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

FeixLiu added 4 commits August 10, 2021 10:09

optim the allreduce_fuse, test=allcase

6a440af

set copy data to false, test=allcase

e5e22b8

update the allreduce fusion, test=allcase

bbfc314

resolve the new logic problem, test=allcase

1b12a43

FeixLiu force-pushed the allreduce_optim branch 2 times, most recently from c82093f to 11928ea Compare August 10, 2021 06:49

wangxicoding reviewed Aug 11, 2021

View reviewed changes

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

python/paddle/distributed/fleet/meta_optimizers/raw_program_optimizer.py Outdated Show resolved Hide resolved

bugs fix for before_idx

abbae2e

FeixLiu force-pushed the allreduce_optim branch from 11928ea to abbae2e Compare August 11, 2021 02:57

wangxicoding approved these changes Aug 11, 2021

View reviewed changes

wangxicoding changed the title ~~Allreduce optim~~ Optimize fused allreduce in raw program Aug 11, 2021

wangxicoding merged commit 4d2994c into PaddlePaddle:develop Aug 11, 2021

FeixLiu deleted the allreduce_optim branch August 11, 2021 06:29

FeixLiu changed the title ~~Optimize fused allreduce in raw program~~ [pure dp performance] Optimize fused allreduce in raw program Oct 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pure dp performance] Optimize fused allreduce in raw program #34509

[pure dp performance] Optimize fused allreduce in raw program #34509

FeixLiu commented Jul 30, 2021 •

edited by wangxicoding

Loading

paddle-bot-old bot commented Jul 30, 2021

wangxicoding left a comment

[pure dp performance] Optimize fused allreduce in raw program #34509

[pure dp performance] Optimize fused allreduce in raw program #34509

Conversation

FeixLiu commented Jul 30, 2021 • edited by wangxicoding Loading

PR types

PR changes

Describe

Test

paddle-bot-old bot commented Jul 30, 2021

wangxicoding left a comment

Choose a reason for hiding this comment

FeixLiu commented Jul 30, 2021 •

edited by wangxicoding

Loading