-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix inplace bug when the first grad_var(loss_grad) is inplace var #37420
fix inplace bug when the first grad_var(loss_grad) is inplace var #37420
Conversation
Thanks for your contribution! |
if (!accumulator) { | ||
if (FLAGS_sort_sum_gradient) { | ||
accumulator.reset(new SortedGradientAccumulator(init_grad_var)); | ||
} else { | ||
accumulator.reset(new EagerGradientAccumulator(init_grad_var)); | ||
} | ||
} | ||
accumulator->IncreaseRefCnt(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这两行的作用是?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可参考PR #34582 。BasicEngine::Init()中,如果传入的grad var(设置好了初始grad值)是一个中间节点,反向计算时,需要对这个中间节点做梯度累加操作。
因为本PR将传入的grad var从叶子节点变成了非叶子节点,导致原梯度累加操作发生了变化,需要指定其ref cnt和cur cnt做梯度聚合。
可以理解成,传入的这个grad var已经是某一个op的输出(不存在的op),在后续操作中,如果这个grad var又作为网络中一个op的输出的话,就可以做梯度聚合了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ad) is inplace var (PaddlePaddle#37420) * fix inplace bug * fix custom grad input error * add unittest * fix inplace bug
…ddlePaddle#37420) * fix inplace bug * fix custom grad input error * add unittest * fix inplace bug
PR types
Bug fixes
PR changes
Others
Describe
问题
对一个非叶子节点执行Inplace操作后,立即执行backward,该节点及更前的节点的梯度结果不对;但是在该节点之后再增加一个节点z,并在z上执行backward,则结果正常。
例子(以paddle.Tensor.tanh_为例):
原因分析
现象分析:
执行inplace操作后,马上执行backward反向,表示反向的第一个grad_var(一般为loss_grad)为inplace var。
从打印的结果及log分析,这种情况出错的原因在于,该inplace var进行了不必要的梯度累加操作。
出现不必要梯度累加的原因在于:
在
BasicEngine::Init()
方法里,对loss_grad进行初始化并获取第一个反向grad_node(init_node)时,会在获取了init_node后,将loss_grad的grad_node clear掉,也就是loss_grad没有grad_node了,导致loss_grad被识别为叶子节点。又由于loss_grad是一个inplace var,这个inplace var在网络中出现了多次,叶子节点出现多次时便会出现梯度累加操作。
结论:
在
BasicEngine::Init()
方法里,将loss_grad的grad_node clear掉了,导致非叶子节点被误认为是叶子节点,引起梯度累加。修复
在
BasicEngine::Init()
方法里,不clear loss_grad的grad_node,也就是loss_grad在反向执行时仍然为非叶子节点。这样修改后,会导致
test_custom_grad_input
单测报错,也就是与 PR #34582 冲突。原因是,
test_custom_grad_input
需要对传入的初始grad var(有可能是中间节点)进行梯度累加。loss_grad从叶子节点变成非叶子节点后,梯度累加器的实现发生了变化。这个PR也修复了这个问题,将传入的初始grad var的梯度累加器变成了梯度聚合方式。