fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33658

zhiboniu · 2021-06-18T08:58:51Z

PR types

Bug fixes

PR changes

OPs

Describe

背景：
gpt2 训练过程中出现loss不稳定、不收敛、最终变成nan的情况。

经排查：
1）在P40上训练正常，V100上训练出现异常。
2）添加一行log打印训练正常，无log打印训练异常。
3）使用原线性相加方式训练正常，使用BlockReduceSum训练异常。

最终通过去掉static共享内存shared，同时添加一行__syncthreads后解决训练异常问题。
同时对另外两个BlockReduceSum加入__syncthreads以提高可靠性。

paddle-bot-old · 2021-06-18T08:58:54Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

ZHUI · 2021-06-18T09:22:27Z

GPT train loss will NaN since the pr #33420 .

jeff41404 · 2021-06-18T13:14:34Z

paddle/fluid/operators/layer_norm_op.cu

-  d_scale_partial = BlockReduceSum<U>(d_scale_partial);
-  d_bias_partial = BlockReduceSum<U>(d_bias_partial);
+  __shared__ U shared_scale[32];
+  __shared__ U shared_bias[32];


用 kMaxBlockDim/lwarpSize 来代替32是否会更节省share memory

已线下讨论，share memory整体空间足够大，可以保持32。

ForFishes

LGTM

ZHUI

LGTM

XiaoguangHu01

LGTM

zhiboniu mentioned this pull request Jun 18, 2021

fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33659

Merged

ZHUI requested a review from Xreki June 18, 2021 09:18

jeff41404 reviewed Jun 18, 2021

View reviewed changes

ForFishes previously approved these changes Jun 18, 2021

View reviewed changes

jeff41404 previously approved these changes Jun 18, 2021

View reviewed changes

ZHUI previously approved these changes Jun 21, 2021

View reviewed changes

fix gpt2 train loss Nan problem

8e3bdd3

zhiboniu dismissed stale reviews from ZHUI, jeff41404, and ForFishes via 8e3bdd3 June 21, 2021 07:49

XiaoguangHu01 approved these changes Jun 22, 2021

View reviewed changes

XiaoguangHu01 merged commit 687571f into PaddlePaddle:develop Jun 22, 2021

zhiboniu deleted the develop_gpt branch September 27, 2021 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33658

fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33658

zhiboniu commented Jun 18, 2021 •

edited

Loading

paddle-bot-old bot commented Jun 18, 2021

ZHUI commented Jun 18, 2021

jeff41404 Jun 18, 2021

zhiboniu Jun 21, 2021

ForFishes left a comment

ZHUI left a comment

XiaoguangHu01 left a comment

fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33658

fix gpt2 train loss Nan problem by add a line __syncthreads in BlockR… #33658

Conversation

zhiboniu commented Jun 18, 2021 • edited Loading

PR types

PR changes

Describe

paddle-bot-old bot commented Jun 18, 2021

ZHUI commented Jun 18, 2021

jeff41404 Jun 18, 2021

Choose a reason for hiding this comment

zhiboniu Jun 21, 2021

Choose a reason for hiding this comment

ForFishes left a comment

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

XiaoguangHu01 left a comment

Choose a reason for hiding this comment

zhiboniu commented Jun 18, 2021 •

edited

Loading