Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Caffe network with my own data: loss does not decrease #4611

Closed
szm-R opened this issue Aug 21, 2016 · 1 comment
Closed

Training Caffe network with my own data: loss does not decrease #4611

szm-R opened this issue Aug 21, 2016 · 1 comment

Comments

@szm-R
Copy link

szm-R commented Aug 21, 2016

Hi everyone
I have been working with caffe for a while and up to nearly a week ago I only used the pre-trained available networks (namely bvlc_reference_caffenet and bvlc_googlenet) I have also fine tuned these two networks for my own data (which consists of four classes and about 1200 images for each class) and it works fine.

Nearly a week ago I decided to train a whole new network on my own data ,as the pre-trained ones were constructed and optimized for imagenet data set with 1000 classes, I thought that a simpler network could also be able to solve my problem. so I wrote a train_val.prototxt based on GoogleNet but much simpler (with only one layer of inception and one fully connected layer at the end ,here is the link to my GoogleNet based train_val, ) this net have been training for a few days and it seems that it's working because the loss is decreasing and I tested the snap shot 4000 and it gave me reasonable accuracy over my data set.

I also tried a simpler network based on CaffeNet (with only one fully connected layer at the end and every other thing the same as train_val.prototxt provided in bvlc_reference_caffenet directory, the link to my train_vel and my solver ) this network doesn't seem to be learning at all, as the loss is not decreasing (It started with about 14 and then it got stuck on about 1.38) I alse tested snapshot 20000 but it only detects one class and with the same probability on all test images (all test instances are detected as one class).

I'm training both networks on the same data which I have converted to lmdb using the following command:
GLOG_logtostderr=1 ./build/tools/convert_imageset --resize_height=256 --resize_width=256 --shuffle Set2/Images/ Set2/TrainingSet.txt train_lmdb
which I believe also shuffles the data, so this shouldn't be the problem

I also tested with "xavier" initialization (as the one used in GoogleNet) but the only difference was that the one detected class has changed. here I'm using the same solver as the previous attempt but with lower base_lr (0.001 as 0.01 resulted in nan loss)

Note that I'm training on CPU (as I don't have an appropriate GPU for training and I only use my laptop's GPU for testing) so the problem couldn't be NVIDIA driver or sth like that. I have also looked through #401 and #3243 suggests a new initialization which I haven't worked with.

should I perhaps wait more?! or there's sth else I'm not doing right?
Thanks in advance for your help.

@szm-R szm-R closed this as completed Aug 27, 2016
@priyapaul
Copy link

@szm2015 Could you please tell me how you fixed this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants