Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU vs CPU: NAN #3

Open
mostafa-saad opened this issue Jul 23, 2015 · 4 comments
Open

GPU vs CPU: NAN #3

mostafa-saad opened this issue Jul 23, 2015 · 4 comments

Comments

@mostafa-saad
Copy link

Hi,

Sometimes, when i have some netowrk/its data, I run over CPU it works well. I run over GPU it gives NAN in soft max outputs and bad accuracy. This happen if network starts from a previous computed weights or random intalization.

Note, same netowkr may work after a while normally. When this concern happens, it happens in all my netowrks based on LSTM layer. In same time, If i removed the layer, caffe works well in GPU.

Is it possible the library has bug in GPU part? any help in that

@nakosung
Copy link

"NaN Explosion" can be triggered from single NaN (it is contagious). Recurrent network may accelerate instability, so reducing some hyper parameters(including learning rate, momentum and initial weight distribution) by order of 10^-3 might stabilize the feedback.

@mostafa-saad
Copy link
Author

But, If I am loading a pretrained model and it sometimes work with GPU and little times doesn't work....is your comment still valid? Meanwhile, always with CPU works!

@nakosung
Copy link

Have you fixed random seed? AFAIK the results doesn't seem to be guaranteed to be same across architectures(CPU/GPU).

http://stackoverflow.com/questions/13937328/division-of-floating-point-numbers-on-gpu-different-from-that-on-cpu

@junhyukoh
Copy link
Owner

  1. Have you tested lstm layer? You can do it as follows.
make test
./build/test/test_lstm_layer.testbin 

It should pass all the tests. (takes a long time)

  1. If it doesn't pass,
    2-1) Make sure that you are using the up-to-date version without any modification.
    2-2) Try "make clean" and "make"

  2. Try a smaller learning rate and see if the loss decreases as it does with CPU.
    If it works, your problem might be related to the numerical issue as @nakosung mentioned. You can set clipping_threshold parameter to prevent it.
    If it doesn't work and you already checked 1) and 2), there might be an unexpected issue in my code.

I have been extensively using this implementation in a video domain.
I haven't found any problems in training/testing LSTM models with GPUs (GTX 6XX, TiTAN, k40c).
Nevertheless, there might be an issue in my implementation.
So, it would be greatly helpful if you give more information about this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants