Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU results are non-deterministic #11057

Closed
daming-lu opened this issue May 30, 2018 · 9 comments
Closed

GPU results are non-deterministic #11057

daming-lu opened this issue May 30, 2018 · 9 comments
Assignees
Labels

Comments

@daming-lu
Copy link
Contributor

daming-lu commented May 30, 2018

When we stabilize all the randomness and run the same training twice on GPU (same GPU core), the results are different by a tiny precision. This is NOT happening on CPU.

See the attachments. The demo code is in this PR

vimdiff w2v_nocuda_t1.txt w2v_nocuda_t2.txt  // no difference

vimdiff w2v_t1_cuda.txt w2v_t2_cuda.txt  // has some tiny difference

w2v_nocuda_t1.txt
w2v_nocuda_t2.txt
w2v_t1_cuda.txt
w2v_t2_cuda.txt

@sidgoyal78
Copy link
Contributor

sidgoyal78 commented May 30, 2018

I think there is probably some issue with the "embedding" layer (and/or the "lookup_table_op"). Even this sentiment analysis code (https://github.com/sidgoyal78/Paddle/blob/a801e7bcb2f4f1e131f6a640ecd84a03d21588ff/test_sa_conv.py) produces inconsistent results when run with the same seed on GPU.

But if the same code is run on CPU, then it yields consistent results.

@wangkuiyi wangkuiyi added the Bug label May 30, 2018
@daming-lu
Copy link
Contributor Author

daming-lu commented May 30, 2018

#10405 (related issue)

@chengduoZH
Copy link
Contributor

chengduoZH commented May 31, 2018

@daming-lu @sidgoyal78
We have noticed this phenomenon, and we have found that some operation's result on GPU is non-determinism, such as cross_entropy and some operations of cudnn.
Other frameworks also have this similar issue, such as TF(tensorflow/tensorflow#2732), Pytorch(soumith/cudnn.torch#270). I saw the same question in Nvidia forums too.

@dzhwinter
Copy link
Contributor

This bug has been located. If the kernel using CudaAtomicAdd, caused by the non-associated of floating number algebra, we will get the non-deterministic result.

@dzhwinter
Copy link
Contributor

That's also convinced by our experiment on the benchmark ops work.#10646

@dzhwinter
Copy link
Contributor

Here is siddharth's reproduce PR. #11133

@emailweixu emailweixu changed the title GPU results are not consistent GPU results are non-deterministic Jun 7, 2018
@daming-lu
Copy link
Contributor Author

@dzhwinter @chengduoZH : Thanks for the updates! One question: do we have a plan to fix it? As you know, Baidu is a major contributor to MLPerf and we want to get performance metrics for our own PaddlePaddle framework 😀

@shanyi15
Copy link
Collaborator

您好,此issue在近一个月内暂无更新,我们将于今天内关闭。若在关闭后您仍需跟进提问,可重新开启此问题,我们将在24小时内回复您。因关闭带来的不便我们深表歉意,请您谅解~感谢您对PaddlePaddle的支持!
Hello, this issue has not been updated in the past month. We will close it today for the sake of other user‘s experience. If you still need to follow up on this question after closing, please feel free to reopen it. In that case, we will get back to you within 24 hours. We apologize for the inconvenience caused by the closure and thank you so much for your support of PaddlePaddle Group!

@lucywsq lucywsq reopened this Jan 4, 2019
@paddle-bot-old
Copy link

Since you haven't replied for more than a year, we have closed this issue/pr.
If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up.
由于您超过一年未回复,我们将关闭这个issue/pr。
若问题未解决或有后续问题,请随时重新打开,我们会继续跟进。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants