Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix inplace bug when the first grad_var(loss_grad) is inplace var #37420

Merged
merged 4 commits into from
Nov 23, 2021

Conversation

pangyoki
Copy link
Contributor

@pangyoki pangyoki commented Nov 22, 2021

PR types

Bug fixes

PR changes

Others

Describe

问题

对一个非叶子节点执行Inplace操作后,立即执行backward,该节点及更前的节点的梯度结果不对;但是在该节点之后再增加一个节点z,并在z上执行backward,则结果正常。

例子(以paddle.Tensor.tanh_为例):

  • 不做inplace操作
import paddle

x = paddle.ones((2,2))
x.stop_gradient = False

y = x * 2
l = y.tanh()  # not use inplace strategy
l.backward()

print("paddle x.grad: ", x.grad.numpy())
# paddle x.grad:  [[0.14130163 0.14130163]
# [0.14130163 0.14130163]]
  • 执行inplace操作后,立即执行backward:结果不对
import paddle

x = paddle.ones((2,2))
x.stop_gradient = False

y = x * 2
l = y.tanh_()  # use inplace strategy
l.backward()

print("paddle x.grad: ", x.grad.numpy())
# paddle x.grad:  [[2.1413016 2.1413016]
# [2.1413016 2.1413016]]
  • 执行inplace操作后,先增加一个非inplace节点,再执行backward:结果正常
import paddle

x = paddle.ones((2,2))
x.stop_gradient = False

y = x * 2
l = y.tanh_()  # use inplace strategy
l = l.sum()   # add non-inplace op node
l.backward()

print("paddle x.grad: ", x.grad.numpy())
# paddle x.grad:  [[0.14130163 0.14130163]
# [0.14130163 0.14130163]]

原因分析

  • 现象分析:
    执行inplace操作后,马上执行backward反向,表示反向的第一个grad_var(一般为loss_grad)为inplace var。
    从打印的结果及log分析,这种情况出错的原因在于,该inplace var进行了不必要的梯度累加操作。

  • 出现不必要梯度累加的原因在于:
    BasicEngine::Init()方法里,对loss_grad进行初始化并获取第一个反向grad_node(init_node)时,会在获取了init_node后,将loss_grad的grad_node clear掉,也就是loss_grad没有grad_node了,导致loss_grad被识别为叶子节点。
    又由于loss_grad是一个inplace var,这个inplace var在网络中出现了多次,叶子节点出现多次时便会出现梯度累加操作。

  • 结论:
    BasicEngine::Init()方法里,将loss_grad的grad_node clear掉了,导致非叶子节点被误认为是叶子节点,引起梯度累加。

修复

BasicEngine::Init()方法里,不clear loss_grad的grad_node,也就是loss_grad在反向执行时仍然为非叶子节点。

这样修改后,会导致test_custom_grad_input单测报错,也就是与 PR #34582 冲突。
原因是,test_custom_grad_input需要对传入的初始grad var(有可能是中间节点)进行梯度累加。loss_grad从叶子节点变成非叶子节点后,梯度累加器的实现发生了变化。这个PR也修复了这个问题,将传入的初始grad var的梯度累加器变成了梯度聚合方式。

  • 叶子节点的accumulator变成非叶子节点的accumulators_with_grad_node_
  • 梯度累加变成梯度聚合,让传入的初始grad var的ref_cnt和cur_cnt都设置为1

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

if (!accumulator) {
if (FLAGS_sort_sum_gradient) {
accumulator.reset(new SortedGradientAccumulator(init_grad_var));
} else {
accumulator.reset(new EagerGradientAccumulator(init_grad_var));
}
}
accumulator->IncreaseRefCnt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这两行的作用是?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

可参考PR #34582 。BasicEngine::Init()中,如果传入的grad var(设置好了初始grad值)是一个中间节点,反向计算时,需要对这个中间节点做梯度累加操作。
因为本PR将传入的grad var从叶子节点变成了非叶子节点,导致原梯度累加操作发生了变化,需要指定其ref cnt和cur cnt做梯度聚合。
可以理解成,传入的这个grad var已经是某一个op的输出(不存在的op),在后续操作中,如果这个grad var又作为网络中一个op的输出的话,就可以做梯度聚合了。

Copy link
Contributor

@MingMingShangTian MingMingShangTian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@zhiqiu zhiqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@pangyoki pangyoki merged commit ee1e164 into PaddlePaddle:develop Nov 23, 2021
pangyoki added a commit to pangyoki/Paddle that referenced this pull request Nov 23, 2021
…ad) is inplace var (PaddlePaddle#37420)

* fix inplace bug

* fix custom grad input error

* add unittest

* fix inplace bug
lanxianghit pushed a commit that referenced this pull request Nov 25, 2021
…ad) is inplace var (#37420) (#37488)

fix inplace bug,Cherry pick PR #37420
Zjq9409 pushed a commit to Zjq9409/Paddle that referenced this pull request Dec 10, 2021
…ddlePaddle#37420)

* fix inplace bug

* fix custom grad input error

* add unittest

* fix inplace bug
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants