-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fast layer norm has non-deterministic problem #56100
Comments
I will take a look tomorrow afternoon |
The compute-sanitizer reports data race issue.
|
This issue doesn't happen with V100 but does with A100, that's why CI didn't catch the issue (CI only has V100). |
This issue also happens in my env, V100&Cuda11.2. CI is V100&Cuda12, but I haven't tested it in CI env. |
I suggest we disable fast_layer_norm temperally, and reopen it after fix. |
@onecatcn, @zhaoyinglia ^^^ |
bug描述 Describe the Bug
Hi @jeng1220 , we found GPT3-1.3B's loss is non-deterministic after #55639.
To reproduce, run this code in PaddleNLP twice and compare the loss value.
There is a unit test that can also reproduce the error result, but the frequency is low.
其他补充信息 Additional Supplementary Information
No response
The text was updated successfully, but these errors were encountered: