Training Caffe network with my own data: loss does not decrease #4611

szm-R · 2016-08-21T11:15:31Z

Hi everyone
I have been working with caffe for a while and up to nearly a week ago I only used the pre-trained available networks (namely bvlc_reference_caffenet and bvlc_googlenet) I have also fine tuned these two networks for my own data (which consists of four classes and about 1200 images for each class) and it works fine.

Nearly a week ago I decided to train a whole new network on my own data ,as the pre-trained ones were constructed and optimized for imagenet data set with 1000 classes, I thought that a simpler network could also be able to solve my problem. so I wrote a train_val.prototxt based on GoogleNet but much simpler (with only one layer of inception and one fully connected layer at the end ,here is the link to my GoogleNet based train_val, ) this net have been training for a few days and it seems that it's working because the loss is decreasing and I tested the snap shot 4000 and it gave me reasonable accuracy over my data set.

I also tried a simpler network based on CaffeNet (with only one fully connected layer at the end and every other thing the same as train_val.prototxt provided in bvlc_reference_caffenet directory, the link to my train_vel and my solver ) this network doesn't seem to be learning at all, as the loss is not decreasing (It started with about 14 and then it got stuck on about 1.38) I alse tested snapshot 20000 but it only detects one class and with the same probability on all test images (all test instances are detected as one class).

I'm training both networks on the same data which I have converted to lmdb using the following command:
GLOG_logtostderr=1 ./build/tools/convert_imageset --resize_height=256 --resize_width=256 --shuffle Set2/Images/ Set2/TrainingSet.txt train_lmdb
which I believe also shuffles the data, so this shouldn't be the problem

I also tested with "xavier" initialization (as the one used in GoogleNet) but the only difference was that the one detected class has changed. here I'm using the same solver as the previous attempt but with lower base_lr (0.001 as 0.01 resulted in nan loss)

Note that I'm training on CPU (as I don't have an appropriate GPU for training and I only use my laptop's GPU for testing) so the problem couldn't be NVIDIA driver or sth like that. I have also looked through #401 and #3243 suggests a new initialization which I haven't worked with.

should I perhaps wait more?! or there's sth else I'm not doing right?
Thanks in advance for your help.

priyapaul · 2017-05-26T20:48:09Z

@szm2015 Could you please tell me how you fixed this issue ?

szm-R closed this as completed Aug 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Caffe network with my own data: loss does not decrease #4611

Training Caffe network with my own data: loss does not decrease #4611

szm-R commented Aug 21, 2016

priyapaul commented May 26, 2017

Training Caffe network with my own data: loss does not decrease #4611

Training Caffe network with my own data: loss does not decrease #4611

Comments

szm-R commented Aug 21, 2016

priyapaul commented May 26, 2017