Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The accuracy of evaluation cannot increase when training imagenet #59

Closed
huangjunshi opened this issue Jan 27, 2014 · 26 comments
Closed

Comments

@huangjunshi
Copy link

I have strictly followed the instruction of Yangqing in his webpage (http://caffe.berkeleyvision.org/imagenet.html) to train the imagenet (including using the .proto files provided by Yangqing) as well as shuffling the training data, while the accuracy on evaluation data stays at 0.001 even now it is 77,920 iterations. Here is the current output:

I0127 08:53:57.624028 37204 solver.cpp:210] Iteration 78000, lr = 0.01
I0127 08:53:57.633610 37204 solver.cpp:68] Iteration 78000, loss = 6.9063
I0127 08:53:57.633633 37204 solver.cpp:90] Testing net
I0127 08:56:01.357560 37204 solver.cpp:117] Test score # 0: 0.001
I0127 08:56:01.357609 37204 solver.cpp:117] Test score # 1: 6.90977
I0127 08:56:33.533275 37204 solver.cpp:210] Iteration 78020, lr = 0.01
I0127 08:56:33.542655 37204 solver.cpp:68] Iteration 78020, loss = 6.90727
I0127 08:57:05.939363 37204 solver.cpp:210] Iteration 78040, lr = 0.01
I0127 08:57:05.948905 37204 solver.cpp:68] Iteration 78040, loss = 6.9073

To find out the reason, I have run the code on mnist, and got the almost correct result. Also, I have sampled some images both from training dataset and evaluation dataset. The labels and images are both OK.

Does anyone have such situation before? Can anyone give me some help on how to solve this problem?

My environment is Ubuntu 13.10 with GTX Titan, CUDA 5.5.

@sguada
Copy link
Contributor

sguada commented Jan 27, 2014

A Test score of 0.001 means that the network is guessing random, that is not learning anything. Shuffling the training data is important, otherwise it could be that each batch only get images of the same class.

In general the Test score should be above 0.01 after the first aprox the 5000 iterations. So if the loss doesn't decrease and the Test score increase, that is a sign that your network is not learning.

You should check your prototxt files and your level-db.

@kloudkl
Copy link
Contributor

kloudkl commented Jan 27, 2014

Train on the Cifar dataset to investigate the effect of shuffling and not shuffling.

@huangjunshi
Copy link
Author

Thanks, @sguada and @kloudkl !

Sorry that I did not make the question clear. In Yangqing's instruction, he did not do the shuffling. Thus, I mean, I followed everything in his instruction, and also did the shuffling when constructing the leveldb.

I have already checked the proto files with the Alex's paper, and almost sure that the configuration is correct. Now, I am checking the code for constructing the leveldb, which I think is the only difference between "mnist demo" and "imagenet demo".

BTW, could you tell me the size of imagenet training data stored by leveldb? In my case, it is about 236.2G, which is strange as the size of original images is about 60G.

@shelhamer
Copy link
Member

The convert_imageset utility source discusses shuffling. Perhaps a note should be added to the recipe.

@kloudkl
Copy link
Contributor

kloudkl commented Jan 27, 2014

@mianba120 , your problem may not have been caused by the order of the data. The ImageNet dataset is really huge and not suitable for debugging if you are using all of images. Successful training can be seen in comments of #33.
Out of concern of shuffling data before each training epoch, I just looked into the code and found that in caffe/solver.cpp

template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file)  {
  ...
  while (iter_++ < param_.max_iter()) {
  ...
  } // while (iter_++ < param_.max_iter())
} // void Solver<Dtype>::Solve(const char* resume_file)

the iterations of different epochs are not separated. In caffe/proto/caffe.proto, there is no definition of epochs

message SolverParameter {
  optional int32 max_iter = 7; // the maximum number of iterations

Therefore, setting the max_iter entails computing expected_epochs * iterations_per_epoch which is a little inconvenient and indeed produced an error in the original imagenet.prototxt (again see comments of #33). If the max_iter % iterations_per_epoch != 0, I am afraid that the last partial epoch consisted of max_iter % iterations_per_epoch iterations would introduce bias into the training dataset.

Although the typo has been fixed in commit b31b316, it suggested us that a better design woud be to set max_epoch and let the iterations_per_epoch = ceil(data_size / data_size_per_minibatch). Then in caffe/solver.cpp we will have the chance to shuffle data before each epoch to make the gradients more random and accelerate the optimization process.

template <typename Dtype>
void Solver<Dtype>::Solve(const char* resume_file)  {
  ...
  while (epoch_++ < param_.max_epoch()) {
    PreEpoch(...); // Shuffle data and some other stuff
    for (size_t i = 0; i < iterations_per_epoch; ++i) {
      iter_++;
      ...
    } // for (size_t i = 0; i < iterations_per_epoch; ++i)
    ...
  } // while (epoch++ < param_.max_epoch())
} // void Solver<Dtype>::Solve(const char* resume_file)

After the change, it is no longer necessary to remember to do shuffling in each example recipe or in any other application.

@huangjunshi
Copy link
Author

@kloudkl , thanks so much for your help. I will check the discussion in #33.

As you have pointed out that:

"If the max_iter % iterations_per_epoch != 0, I am afraid that the last partial epoch consisted of max_iter % iterations_per_epoch iterations would introduce bias into the training dataset.".

I think some code should also be refined at the file examples/convert_imageset.cpp, from line 82:

int main(int argc, char** argv) {
... ...
if (++count % 1000 == 0) {
db->Write(leveldb::WriteOptions(), batch);
LOG(ERROR) << "Processed " << count << " files.";
delete batch;
batch = new leveldb::WriteBatch();
}
}
delete db;
return 0;
}

Here, I think he drops the training images from 1,281,001 - 1,281,167, as I cannot find anyplace to write the last batch which contains those images.

In your second part, I think it may be better if we can separate the "epoch loop" and "batch loop", however, the PreEpoch() may be time-consuming, which I suppose should be done offline.

Also, could you tell me the file size of imagenet training data stored by leveldb? I need to make sure the correctness of my training data. :)

@Yangqing
Copy link
Member

The convert_imageset.cpp does contain an error that should be fixed. Thanks
for the heads-up! Adding one additional db->Write() function should do the
trick.

Imagenet files stored in the leveldb are 256x256. The images are
uncompressed and are stored as raw pixels, and that's why you are seeing a
db size larger than that of the original images.

IMHO, I am not very keen on the per-epoch random shuffling, mainly due to
the following reasons:

(1) Epochs are simply a notion we use to track the progress of training,
and are not really enforced - there is not explicit constraints that we
have to train 90 epochs, not one image more or less. Thus, it might be not
as useful to enforce the epoch boundaries.

(2) Per-epoch random shuffling of a leveldb will really hurt speed, since a
leveldb random shuffle on a single machine will involve random hard disk
access, which is very slow (unless one uses SSD).

(3) Do a one-time shuffling and then go sequentially through all the data
seem to be giving reasonable speed, and I haven't observed faster
convergence rate with random re-shuffling with smaller benchmark datasets.

Yangqing

On Mon, Jan 27, 2014 at 12:52 AM, mianba120 notifications@github.comwrote:

@kloudkl https://github.com/kloudkl , thanks so much for your help. I
will check the discussion in #33 #33.

As you have pointed out that:

"If the max_iter % iterations_per_epoch != 0, I am afraid that the last
partial epoch consisted of max_iter % iterations_per_epoch iterations would
introduce bias into the training dataset.".

I think some code should also be refined at the file
examples/convert_imageset.cpp, from line 82:

int main(int argc, char** argv) {
... ...
if (++count % 1000 == 0) {
db->Write(leveldb::WriteOptions(), batch);
LOG(ERROR) << "Processed " << count << " files.";
delete batch;
batch = new leveldb::WriteBatch();
}
}
delete db;
return 0;
}

Here, I think he drops the training images from 1,281,001 - 1,281,167, as
I cannot find anyplace to write the last batch which contains those images.

In your second part, I think it may be better if we can separate the
"epoch loop" and "batch loop", however, the PreEpoch() may be
time-consuming, which I suppose should be done offline.

Also, could you tell me the file size of imagenet training data stored by
leveldb? I need to make sure the correctness of my training data. :)


Reply to this email directly or view it on GitHubhttps://github.com//issues/59#issuecomment-33350014
.

@palmforest
Copy link

I may have exactly the same problem with @mianba120 described. I followed the new version of imagenet training recipe, with the Caffe package updated on Feb. 10. I also did shuffling with 'convert_imageset'.
However, the testing score remains 0.001, the loss keep increasing slowly.

The current log:
I0212 19:31:26.893280 13373 solver.cpp:207] Iteration 12940, lr = 0.01
I0212 19:31:26.903733 13373 solver.cpp:65] Iteration 12940, loss = 6.91592
I0212 19:32:18.686200 13373 solver.cpp:207] Iteration 12960, lr = 0.01
I0212 19:32:18.696670 13373 solver.cpp:65] Iteration 12960, loss = 6.90673
I0212 19:33:03.411830 13373 solver.cpp:207] Iteration 12980, lr = 0.01
I0212 19:33:03.422310 13373 solver.cpp:65] Iteration 12980, loss = 6.91012
I0212 19:33:47.450816 13373 solver.cpp:207] Iteration 13000, lr = 0.01
I0212 19:33:47.461206 13373 solver.cpp:65] Iteration 13000, loss = 6.90916
I0212 19:33:47.461225 13373 solver.cpp:87] Testing net
I0212 19:36:39.351974 13373 solver.cpp:114] Test score #0: 0.001
I0212 19:36:39.352042 13373 solver.cpp:114] Test score #1: 6.90934
I0212 19:37:30.193382 13373 solver.cpp:207] Iteration 13020, lr = 0.01
I0212 19:37:30.203824 13373 solver.cpp:65] Iteration 13020, loss = 6.90191
I0212 19:38:14.958088 13373 solver.cpp:207] Iteration 13040, lr = 0.01
I0212 19:38:14.968525 13373 solver.cpp:65] Iteration 13040, loss = 6.90839
I0212 19:39:08.426115 13373 solver.cpp:207] Iteration 13060, lr = 0.01
I0212 19:39:08.436560 13373 solver.cpp:65] Iteration 13060, loss = 6.9094
I0212 19:39:50.263488 13373 solver.cpp:207] Iteration 13080, lr = 0.01
I0212 19:39:50.273931 13373 solver.cpp:65] Iteration 13080, loss = 6.90351
I0212 19:40:35.237869 13373 solver.cpp:207] Iteration 13100, lr = 0.01
I0212 19:40:35.248314 13373 solver.cpp:65] Iteration 13100, loss = 6.90753

I am using Ubuntu 12.04, K20. The mnist demo works well with my Caffe setup.

I am wondering what the problem is. For the training and validation images, I just converted them into 256x256 as the simplest way. Has anyone succeed with training imagenet by resizing like this? Should I do the same as“The images are reshaped so that the shorter side has length 256, and the centre 256x256 part is cropped for training”, as mentioned in http://decaf.berkeleyvision.org/about ?

@huangjunshi
Copy link
Author

hi @palmforest , this is mianba120 (I have changed my name...). I have fixed this problem accidentally, though I have no idea on how to give you the correct answer. However, I found from my successful case on imagenet:

  1. It is OK to directly resize the images to 256x256 without cropping. The shuffling is not so important, which our group have tested with convnet (implemented by Alex).
  2. The original Caffe code (master branch) is also OK. The loss can start dropping to 6.89 at about 2,000 iteration, and the testing accuracy is about 0.002 at 2,000 iteration.
  3. The mean update of weight (the fully-connection may be smaller) is about 10^(-7) - 10^(-8), although 10^(-5) - 10^(-6) should be normal.

Overall, I think you may try:

  1. Use the original Caffe code.
  2. My environment: g++ 4.6, cuda 5.5, mkl is directly downloaded from intel's official website. The driver of my GTX Titan is "NVIDIA-Linux-x86_64-319.82.run".... within which I guess the driver version may be important, as my failure case is based the NVIDIA-Linux-x86_64-331.20.run.
  3. Lastly, though I don't know why it works finally, I think you may try to run the code for times, and stop it if the loss doesn't drop below 6.89 at 3,000 iteration. Because I think sometimes I may be so unlucky that I always get bad random initialization.

Anyway, if you want, I can send you the log of imagenet. Also, you may try my branch.... (not an advertisement...)

Good luck!

@palmforest
Copy link

Hi @huangjunshi Many thanks for sharing your view and experience, I am using the original Caffe codes and have tried to re-run the training 3 times...but, the testing scores remain 0.001 after 5000 iterations...I am still trying for good lucks...:)

Could you please share your log on training imagenet? My email address is yuchen.ee@gmail.com

I am also curious about bad/good random initialization. Theoretically, the random initialization should not lead to the problem like this, has anyone else met this problem with Caffe?

@niuchuang
Copy link

Hi@palmforest,I have met the same problem now,and I think you must have solved it,I will really appreciate it if you can share your wisdom!

@niuchuang
Copy link

Hi huangjunshi,I have met the same problem now,and I think you must have solved it,I will really appreciate it if you can share your wisdom!

@huangjunshi
Copy link
Author

Hi @niuchuang , basically, this problem just disappeared after several trials without many modification, and it never happens again in last year.... The only thing I can remember is that the initial value of bias is changed into 0.7 (or even 0.5) for all the layers if it's 1 originally. Usually, the loss should drop below 6.90 after about 2,000 - 3,000 iterations (batch size is 256). Another observation which should be helpful is that the mean of gradient of loss w.r.t. weights/bias in FC8 layer should be 10^-5 - 10^-6 (You may write this part by yourself). If it is less than 10^-6, such as 10^-7, this usually leads to a bad solution, or even cannot converge.

@jnhwkim
Copy link

jnhwkim commented Apr 29, 2015

In my case, the first iteration # for below 6.9 was 4,500, however, it drops rapidly after that.
I'm getting loss as about 3.7 and accuracy as around 25% at iteration 15,500. If you have some time, be patient, let it do what it does.

@stevenluzheng
Copy link

HI ,Kim

I met same problem as you describe in your mail, did you finally solve it?

@jnhwkim
Copy link

jnhwkim commented May 1, 2015

@stevenluzheng Let it go few hours. I didn't nothing but after 2 days, it is getting over 50% accuracy now.

@stevenluzheng
Copy link

Thanks Kim

Actually,I have done 25K iterations from yesterday, and accuracy still remains 0.39, I use my own data set to train caffe, it seems training fails for some uncertain reasons.

BTW do you use your own dataset to train caffe? if you use your own dataset, so how many pictures you use in training set and val set respectively?

@jnhwkim
Copy link

jnhwkim commented May 1, 2015

@stevenluzheng I used ilsvrc12 dataset which has 1,281,167 images for training. I heard that it'll take 6 days to get a sufficient accuracy.

@stevenluzheng
Copy link

Oh....I think you might use ImageNet dataset to train your caffee, this is a magnitude dataset , I guess huge dataset can train caffe network sufficiently, small dataset might not cause deep learning network convergence, I only use 300 pictures in training process and 40 pictures in val process...

Did you ever use you own data set to train and val caffee before?

@acpn
Copy link

acpn commented May 28, 2015

Hi guys I have the same problem, but in my case loss = 8.5177, I'm trying use LFW dataset to train my net, for this I write my own file.prototxt, and I follow paper of Guosheng Hu for write the architecture. Someone have any idea?

@stevenluzheng
Copy link

acpn:
Try FaceScrub, LFW is only used for challenging, not for training, BTW, please use CASIA-webface this recommends a workable model, most of us use it

@acpn
Copy link

acpn commented May 30, 2015

Hi, thanks stevenluzheng, but i'm trying reproduce results of this paper: http://arxiv.org/abs/1504.02351.

lukeyeager pushed a commit to lukeyeager/caffe that referenced this issue Nov 3, 2015
Properly set GPU arena in test and time functions
@aTnT
Copy link

aTnT commented May 20, 2016

I had a similar problem as described here but in my case the root cause was that i was using labels (in the filename and labels .txt file) starting at 1 instead of 0.

@fucevin
Copy link

fucevin commented Feb 4, 2017

BN implementation from cuDNN has no accuracy problem, at least for cuDNN5. I trained resnet-50 with a top-1 accuracy of 75%.

@neftaliw
Copy link

What @huangjunshi was saying basically did it for me, but I changed every bias=1 to bias=0.1, but how are we supposed to know this from the very incomplete documentation caffe's website has. The tutorials are meant for people who know nothing about caffe and getting into deep learning and yet they leave a lot of things out, and this bias thing is their own mistake, did they even test the tutorials before publishing them?

@Dror370
Copy link

Dror370 commented Mar 26, 2018

Hi all ,
I suggest you all to check the softmaxwithloss layer, see that you define the layer correctly,
In case you define Phase:train, it is not work properly,
The layer definition should not include any phase ....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests