GPU vs CPU: NAN #3

mostafa-saad · 2015-07-23T16:45:30Z

Hi,

Sometimes, when i have some netowrk/its data, I run over CPU it works well. I run over GPU it gives NAN in soft max outputs and bad accuracy. This happen if network starts from a previous computed weights or random intalization.

Note, same netowkr may work after a while normally. When this concern happens, it happens in all my netowrks based on LSTM layer. In same time, If i removed the layer, caffe works well in GPU.

Is it possible the library has bug in GPU part? any help in that

nakosung · 2015-07-23T22:53:19Z

"NaN Explosion" can be triggered from single NaN (it is contagious). Recurrent network may accelerate instability, so reducing some hyper parameters(including learning rate, momentum and initial weight distribution) by order of 10^-3 might stabilize the feedback.

mostafa-saad · 2015-07-23T22:55:33Z

But, If I am loading a pretrained model and it sometimes work with GPU and little times doesn't work....is your comment still valid? Meanwhile, always with CPU works!

nakosung · 2015-07-27T00:25:31Z

Have you fixed random seed? AFAIK the results doesn't seem to be guaranteed to be same across architectures(CPU/GPU).

http://stackoverflow.com/questions/13937328/division-of-floating-point-numbers-on-gpu-different-from-that-on-cpu

junhyukoh · 2015-07-27T07:19:09Z

Have you tested lstm layer? You can do it as follows.

make test
./build/test/test_lstm_layer.testbin

It should pass all the tests. (takes a long time)

If it doesn't pass,
2-1) Make sure that you are using the up-to-date version without any modification.
2-2) Try "make clean" and "make"
Try a smaller learning rate and see if the loss decreases as it does with CPU.
If it works, your problem might be related to the numerical issue as @nakosung mentioned. You can set clipping_threshold parameter to prevent it.
If it doesn't work and you already checked 1) and 2), there might be an unexpected issue in my code.

I have been extensively using this implementation in a video domain.
I haven't found any problems in training/testing LSTM models with GPUs (GTX 6XX, TiTAN, k40c).
Nevertheless, there might be an issue in my implementation.
So, it would be greatly helpful if you give more information about this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU vs CPU: NAN #3

GPU vs CPU: NAN #3

mostafa-saad commented Jul 23, 2015

nakosung commented Jul 23, 2015

mostafa-saad commented Jul 23, 2015

nakosung commented Jul 27, 2015

junhyukoh commented Jul 27, 2015

GPU vs CPU: NAN #3

GPU vs CPU: NAN #3

Comments

mostafa-saad commented Jul 23, 2015

nakosung commented Jul 23, 2015

mostafa-saad commented Jul 23, 2015

nakosung commented Jul 27, 2015

junhyukoh commented Jul 27, 2015