Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NPU] support global accumulator for adam #32780

Merged
merged 10 commits into from
May 13, 2021

Conversation

zhiqiu
Copy link
Contributor

@zhiqiu zhiqiu commented May 7, 2021

PR types

Performance optimization

PR changes

OPs

Describe

[NPU] support global accumulator for adam

As described in #32605, on ernie-3.0 model, we can see that there are several time bubbles between two adam kernels.
image

There are mainly two problems that cost much time,

  1. update beta1/beta2 pow

  2. convert beta1/beta2/epsilon to NPU tensor

PR #32605 solves one problem 2, and this PR tries to solve problem 1.

why

The original implementation of AdamOptimizer creates beta1_pow and beta2_pow for each parameter and updates them in adam op for each parameter.
It SHOULD be pointed out that, actually, the value of the beta1_pow and beta2_pow of each parameter is the same.

This works fine in adam CUDA kernel, since the beta1_pow and beta2_pow can be updated fast. However, in NPU kernel, it requires to call two mul op and cost much time.

How

So, we introduce global beta_pow, which means only creates one beta1_pow and one beta2_pow for all the parameters of the whole model.
Specificlly,

  • Add bool attribute use_global_beta_pow to adam op. If true, the outputs(Beta1PowOut, Beta2PowOut) will not be used in adam op, "
    and beta_pow will be updated after all adam op in the model.
  • add bool parameter use_global_beta_pow to paddle.fluid.optimizer.Adam, If true, Adam will use global beta_pow for whole model instead of creating beta_pow for each parameter.

As can be seen in the timeline, there is no mul between two ApplyAdam
image

Performance

  • before
    image

  • after
    image

22211 -> 24481 tokens/s, +10 %

@paddle-bot-old
Copy link

paddle-bot-old bot commented May 7, 2021

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

liym27
liym27 previously approved these changes May 10, 2021
Copy link
Contributor

@liym27 liym27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@wanghuancoder wanghuancoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiqiu zhiqiu requested a review from phlrain May 13, 2021 01:58
@zhiqiu zhiqiu merged commit dace3fd into PaddlePaddle:develop May 13, 2021
zhaoyinglia pushed a commit to zhaoyinglia/Paddle that referenced this pull request Sep 2, 2021
* add use_global_beta_pow

* add use_global_beta_pow

* update npu kernel

* update python api

* refine code

* add ut for use_global_beta_pow

* fix npu kernel

* add ut for api

* add ut for exception

* add ut for save/load
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants