Crash due to Boost RNG and persisting random number generator state #335

sguada · 2014-04-17T18:44:42Z

I'm getting random core dumps during fillers initialization after updating dev and using atlas and blas.

This issue, I think is related to #297, but not sure. Since I experienced after updating dev. I know it is related to fillers and random numbers generation, but not sure where since I just get a core dump without more information.

@shelhamer is there a way to capture exceptions and print a more informative messages?

Initially I was able to get around it by switching to MKL, but after ed38827, I get the same error even with MKL flag.

The text was updated successfully, but these errors were encountered:

jeffdonahue · 2014-04-17T23:51:02Z

I've also seen this, but (of the included caffe examples) only when attempting to train the imagenet model -- lenet and cifar_full always seem to work, imagenet fails after trying to initialize conv1 maybe 25-50% of the time, and works perfectly the rest of the time. (I just tried running it 10 times, and it failed in 3 of the trials.) It's very strange, both cifar_full and imagenet are using the GaussianFiller and only imagenet ever seems to fail. And it makes no sense to me that this could only happen some of the time...

is there a way to capture exceptions and print a more informative messages?

As far as I know you pretty much have to use GDB (or some other debugger) to debug segfaults. GDB shows that the segfault is indeed occurring at a line where I save the state of the RNG -- #5 0x000000000047964d in caffe::Caffe::set_generator (other_rng=0x7fffffffc700) at src/caffe/common.cpp:67

Here is the full output and backtrace from GDB:

[~/caffe/examples/imagenet]$ GLOG_logtostderr=true gdb ../../build/tools/train_net.bin
GNU gdb (GDB) 7.7
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ../../build/tools/train_net.bin...done.
(gdb) run imagenet_solver.prototxt
Starting program: /home/jdonahue/dev/caffe-clean/build/tools/train_net.bin imagenet_solver.prototxt
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
I0417 16:42:07.785027 15139 train_net.cpp:26] Starting Optimization
I0417 16:42:07.785121 15139 solver.cpp:41] Creating training net.
I0417 16:42:07.785768 15139 net.cpp:64] Memory required for Data0
I0417 16:42:07.785820 15139 net.cpp:75] Creating Layer data
I0417 16:42:07.785837 15139 net.cpp:111] data -> data
I0417 16:42:07.785858 15139 net.cpp:111] data -> label
I0417 16:42:07.785904 15139 data_layer.cpp:149] Opening leveldb ilsvrc12_train_leveldb
I0417 16:42:07.876922 15139 data_layer.cpp:189] output data size: 256,3,227,227
I0417 16:42:07.876988 15139 data_layer.cpp:208] Loading mean file from../../data/ilsvrc12/imagenet_mean.binaryproto
I0417 16:42:07.939506 15139 data_layer.cpp:229] Initializing prefetch
[New Thread 0x7fffe4199700 (LWP 15143)]
I0417 16:42:07.941756 15139 data_layer.cpp:232] Prefetch initialized.
I0417 16:42:07.941812 15139 net.cpp:126] Top shape: 256 3 227 227 (39574272)
I0417 16:42:07.941854 15139 net.cpp:126] Top shape: 256 1 1 1 (256)
I0417 16:42:07.941862 15139 net.cpp:134] Memory  required for Data 158298112
I0417 16:42:07.941874 15139 net.cpp:157] data does not need backward computation.
I0417 16:42:07.941897 15139 net.cpp:75] Creating Layer conv1
I0417 16:42:07.941907 15139 net.cpp:85] conv1 <- data
I0417 16:42:07.941926 15139 net.cpp:111] conv1 -> conv1
[New Thread 0x7fffe2db1700 (LWP 15144)]
train_net.bin: /usr/local/include/boost/smart_ptr/shared_ptr.hpp:653: typename boost::detail::sp_member_access<T>::type boost::shared_ptr<T>::operator->() const [with
T = caffe::Caffe::RNG, typename boost::detail::sp_member_access<T>::type = caffe::Caffe::RNG*]: Assertion `px != 0' failed.

Program received signal SIGABRT, Aborted.
0x00007fffef7a4425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffef7a4425 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffef7a7b8b in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fffef79d0ee in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fffef79d192 in __assert_fail () from /lib/x86_64-linux-gnu/libc.so.6
#4  0x000000000047a7f7 in boost::shared_ptr<caffe::Caffe::RNG>::operator-> (this=0x7fffdc0008d0) at /usr/local/include/boost/smart_ptr/shared_ptr.hpp:653
#5  0x000000000047964d in caffe::Caffe::set_generator (other_rng=0x7fffffffc700) at src/caffe/common.cpp:67
#6  0x0000000000483bb2 in caffe::caffe_set_rng (other=...) at ./include/caffe/util/rng.hpp:18
#7  0x0000000000486b4f in caffe::caffe_rng_gaussian<float> (n=34848, a=0, sigma=0.00999999978, r=0xa38e10) at src/caffe/util/math_functions.cpp:351
#8  0x00000000004cc9c2 in caffe::GaussianFiller<float>::Fill (this=0x950130, blob=0x950090) at ./include/caffe/filler.hpp:71
#9  0x00000000004d30ca in caffe::ConvolutionLayer<float>::SetUp (this=0x94fcd0, bottom=..., top=0x8fcc88) at src/caffe/layers/conv_layer.cpp:59
#10 0x000000000046498f in caffe::Net<float>::Init (this=0x8fe410, in_param=...) at src/caffe/net.cpp:124
#11 0x00000000004635b1 in caffe::Net<float>::Net (this=0x8fe410, param_file=...) at src/caffe/net.cpp:31
#12 0x0000000000455305 in caffe::Solver<float>::Init (this=0x7fffffffda70, param=...) at src/caffe/solver.cpp:42
#13 0x0000000000455096 in caffe::Solver<float>::Solver (this=0x7fffffffda70, param=...) at src/caffe/solver.cpp:23
#14 0x0000000000415622 in caffe::SGDSolver<float>::SGDSolver (this=0x7fffffffda70, param=...) at ./include/caffe/solver.hpp:57
#15 0x00000000004151c6 in main (argc=2, argv=0x7fffffffdce8) at tools/train_net.cpp:27

Please do comment (or PR if you like!) if anyone thinks they know what might be happening here, or knows the proper way to save the state of the Boost RNG and pass it around to use to generate numbers from different distributions (e.g. Gaussian, uniform, Bernoulli).

shelhamer · 2014-04-18T04:16:50Z

To confirm, this is related to #297 and not the compiling with our without MKL. No matter the BLAS choice in the ed38827 Makefile.config, this is an issue as the Boost RNG code is the cpu RNG support throughout Caffe.

(Except for where rand() is called, which should all be replaced at some point...)

shelhamer · 2014-04-22T05:27:27Z

Fixed in #348. RNG aesthetically improved in #336.

sguada assigned jeffdonahue Apr 17, 2014

shelhamer changed the title ~~Random core dumps during fillers initialization of conv1~~ Crash due to Boost RNG and persisting random number generator state Apr 18, 2014

shelhamer added the bug label Apr 18, 2014

This was referenced Apr 18, 2014

Fix RNG segfault related to #297 #336

Merged

DataLayer singleton bugfix #348

Merged

shelhamer closed this as completed Apr 22, 2014

sguada mentioned this issue May 12, 2014

Training imagenet: loss does not decrease #401

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash due to Boost RNG and persisting random number generator state #335

Crash due to Boost RNG and persisting random number generator state #335

sguada commented Apr 17, 2014

jeffdonahue commented Apr 17, 2014

shelhamer commented Apr 18, 2014

shelhamer commented Apr 22, 2014

Crash due to Boost RNG and persisting random number generator state #335

Crash due to Boost RNG and persisting random number generator state #335

Comments

sguada commented Apr 17, 2014

jeffdonahue commented Apr 17, 2014

shelhamer commented Apr 18, 2014

shelhamer commented Apr 22, 2014