fix inplace bug when the first grad_var(loss_grad) is inplace var #37420

pangyoki · 2021-11-22T08:15:43Z

PR types

Bug fixes

PR changes

Others

Describe

问题

对一个非叶子节点执行Inplace操作后，立即执行backward，该节点及更前的节点的梯度结果不对；但是在该节点之后再增加一个节点z，并在z上执行backward，则结果正常。

例子（以paddle.Tensor.tanh_为例）：

不做inplace操作

import paddle

x = paddle.ones((2,2))
x.stop_gradient = False

y = x * 2
l = y.tanh()  # not use inplace strategy
l.backward()

print("paddle x.grad: ", x.grad.numpy())
# paddle x.grad:  [[0.14130163 0.14130163]
# [0.14130163 0.14130163]]

执行inplace操作后，立即执行backward：结果不对

import paddle

x = paddle.ones((2,2))
x.stop_gradient = False

y = x * 2
l = y.tanh_()  # use inplace strategy
l.backward()

print("paddle x.grad: ", x.grad.numpy())
# paddle x.grad:  [[2.1413016 2.1413016]
# [2.1413016 2.1413016]]

执行inplace操作后，先增加一个非inplace节点，再执行backward：结果正常

import paddle

x = paddle.ones((2,2))
x.stop_gradient = False

y = x * 2
l = y.tanh_()  # use inplace strategy
l = l.sum()   # add non-inplace op node
l.backward()

print("paddle x.grad: ", x.grad.numpy())
# paddle x.grad:  [[0.14130163 0.14130163]
# [0.14130163 0.14130163]]

原因分析

现象分析：
执行inplace操作后，马上执行backward反向，表示反向的第一个grad_var（一般为loss_grad）为inplace var。
从打印的结果及log分析，这种情况出错的原因在于，该inplace var进行了不必要的梯度累加操作。
出现不必要梯度累加的原因在于：
在BasicEngine::Init()方法里，对loss_grad进行初始化并获取第一个反向grad_node（init_node）时，会在获取了init_node后，将loss_grad的grad_node clear掉，也就是loss_grad没有grad_node了，导致loss_grad被识别为叶子节点。
又由于loss_grad是一个inplace var，这个inplace var在网络中出现了多次，叶子节点出现多次时便会出现梯度累加操作。
结论：
在BasicEngine::Init()方法里，将loss_grad的grad_node clear掉了，导致非叶子节点被误认为是叶子节点，引起梯度累加。

修复

在BasicEngine::Init()方法里，不clear loss_grad的grad_node，也就是loss_grad在反向执行时仍然为非叶子节点。

这样修改后，会导致test_custom_grad_input单测报错，也就是与 PR #34582 冲突。
原因是，test_custom_grad_input需要对传入的初始grad var（有可能是中间节点）进行梯度累加。loss_grad从叶子节点变成非叶子节点后，梯度累加器的实现发生了变化。这个PR也修复了这个问题，将传入的初始grad var的梯度累加器变成了梯度聚合方式。

叶子节点的accumulator变成非叶子节点的accumulators_with_grad_node_
梯度累加变成梯度聚合，让传入的初始grad var的ref_cnt和cur_cnt都设置为1

paddle-bot-old · 2021-11-22T08:15:46Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

zhwesky2010 · 2021-11-23T02:48:18Z

paddle/fluid/imperative/basic_engine.cc

    if (!accumulator) {
      if (FLAGS_sort_sum_gradient) {
        accumulator.reset(new SortedGradientAccumulator(init_grad_var));
      } else {
        accumulator.reset(new EagerGradientAccumulator(init_grad_var));
      }
    }
+    accumulator->IncreaseRefCnt();


这两行的作用是？

可参考PR #34582 。BasicEngine::Init()中，如果传入的grad var（设置好了初始grad值）是一个中间节点，反向计算时，需要对这个中间节点做梯度累加操作。
因为本PR将传入的grad var从叶子节点变成了非叶子节点，导致原梯度累加操作发生了变化，需要指定其ref cnt和cur cnt做梯度聚合。
可以理解成，传入的这个grad var已经是某一个op的输出（不存在的op），在后续操作中，如果这个grad var又作为网络中一个op的输出的话，就可以做梯度聚合了。

MingMingShangTian

LGTM

zhiqiu

LGTM

…ad) is inplace var (PaddlePaddle#37420) * fix inplace bug * fix custom grad input error * add unittest * fix inplace bug

…ad) is inplace var (#37420) (#37488) fix inplace bug，Cherry pick PR #37420

…ddlePaddle#37420) * fix inplace bug * fix custom grad input error * add unittest * fix inplace bug

pangyoki added 4 commits November 19, 2021 04:09

fix inplace bug

131de92

fix custom grad input error

e4ed837

add unittest

e64f09a

fix inplace bug

45a423b

zhwesky2010 reviewed Nov 23, 2021

View reviewed changes

MingMingShangTian approved these changes Nov 23, 2021

View reviewed changes

zhiqiu approved these changes Nov 23, 2021

View reviewed changes

pangyoki merged commit ee1e164 into PaddlePaddle:develop Nov 23, 2021

pangyoki mentioned this pull request Nov 23, 2021

【Cherry-pick PR 37420】fix inplace bug when the first grad_var(loss_grad) is inplace var #37488

Merged

lanxianghit pushed a commit that referenced this pull request Nov 25, 2021

Cherry-pick PR 37420, fix inplace bug when the first grad_var(loss_gr…

d31d597

…ad) is inplace var (#37420) (#37488) fix inplace bug，Cherry pick PR #37420

Zjq9409 pushed a commit to Zjq9409/Paddle that referenced this pull request Dec 10, 2021

fix inplace bug when the first grad_var(loss_grad) is inplace var (Pa…

c876d97

…ddlePaddle#37420) * fix inplace bug * fix custom grad input error * add unittest * fix inplace bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix inplace bug when the first grad_var(loss_grad) is inplace var #37420

fix inplace bug when the first grad_var(loss_grad) is inplace var #37420

pangyoki commented Nov 22, 2021 •

edited

Loading

paddle-bot-old bot commented Nov 22, 2021

zhwesky2010 Nov 23, 2021

pangyoki Nov 23, 2021

MingMingShangTian left a comment

zhiqiu left a comment

fix inplace bug when the first grad_var(loss_grad) is inplace var #37420

fix inplace bug when the first grad_var(loss_grad) is inplace var #37420

Conversation

pangyoki commented Nov 22, 2021 • edited Loading

PR types

PR changes

Describe

问题

原因分析

修复

paddle-bot-old bot commented Nov 22, 2021

zhwesky2010 Nov 23, 2021

Choose a reason for hiding this comment

pangyoki Nov 23, 2021

Choose a reason for hiding this comment

MingMingShangTian left a comment

Choose a reason for hiding this comment

zhiqiu left a comment

Choose a reason for hiding this comment

pangyoki commented Nov 22, 2021 •

edited

Loading