Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added bvlc_googlenet prototxt and weights #1598

Merged
merged 1 commit into from
Dec 21, 2014
Merged

Conversation

sguada
Copy link
Contributor

@sguada sguada commented Dec 19, 2014

This PR add GoogleNet to the set of models provided by BVLC, it includes the the prototxt needed for training and deploying.

This model is a replication of the model described in the GoogleNet publication. We would like to thank Christian Szegedy for all his help in the replication of GoogleNet model.

Differences:

  • not training with the relighting data-augmentation;
  • not training with the scale or aspect-ratio data-augmentation;
  • uses "xavier" to initialize the weights instead of "gaussian";
  • quick_solver.prototxt uses a different learning rate decay policy than the original solver.prototxt, that allows a much faster training (60 epochs vs 250 epochs);

The bundled model is the iteration 2,400,000 snapshot (60 epochs) using quick_solver.prototxt

This bundled model obtains a top-1 accuracy 68.7% (31.3% error) and a top-5 accuracy 88.9% (11.1% error) on the validation set, using just the center crop.
(Using the average of 10 crops, (4 + 1 center) * 2 mirror, should obtain a bit higher accuracy.)

Timings for bvlc_googlenet with cuDNN using batch_size:128 on a K40c:

  • Average Forward pass: 562.841 ms.
  • Average Backward pass: 1123.84 ms.
  • Average Forward-Backward: 1688.8 ms.

P.S: For timing look at #1317

@sguada
Copy link
Contributor Author

sguada commented Dec 19, 2014

@shelhamer @longjon @jeffdonahue do you know why Travis keep failing because it cannot download CUDA?

@emasa
Copy link

emasa commented Dec 20, 2014

The publication link points to AlexNet research. Should it be http://arxiv.org/abs/1409.4842?

@sguada
Copy link
Contributor Author

sguada commented Dec 20, 2014

@emasa thanks for catching the wrong link

@shelhamer
Copy link
Member

@sguada please push the GoogLeNet paper link and merge. Thanks.

(The Travis deal is a just an intermittent issue with bandwidth that doesn't matter. Feel free to ignore it.)

sguada added a commit that referenced this pull request Dec 21, 2014
Added bvlc_googlenet prototxt and weights
@sguada sguada merged commit 59ecb2a into BVLC:dev Dec 21, 2014
@anshan-XR-ROB
Copy link

@sguada I run your implementation in the newest caffe-dev. There is an error occurred. Using alexnet prototxt in caffe model directory also cause this bug. How can it be fixed?

layers {
bottom: "inception_4e/output"
top: "pool4/3x3_s2"
name: "pool4/3x3_s2"
type:
I1221 14:51:40.897672 23978 layer_factory.hpp:78] Creating layer data
F1221 14:51:40.897716 23978 layer_factory.hpp:81] Check failed: registry.count(t
*** Check failure stack trace: ***
@ 0x7ffacbfeea5d (unknown)
@ 0x7ffacbff2c57 (unknown)
@ 0x7ffacbff0ad9 (unknown)
@ 0x7ffacbff0ddd (unknown)
@ 0x4686d8 caffe::GetLayer<>()
@ 0x474e87 caffe::Net<>::Init()
@ 0x47721e caffe::Net<>::Net()
@ 0x4598d7 caffe::Solver<>::InitTrainNet()
@ 0x45ad27 caffe::Solver<>::Init()
@ 0x45b225 caffe::Solver<>::Solver()
@ 0x41b558 caffe::GetSolver<>()
@ 0x417570 train()
@ 0x417186 main
@ 0x347bc1ecdd (unknown)
@ 0x4167c9 (unknown)
Aborted

@sguada
Copy link
Contributor Author

sguada commented Dec 21, 2014

@AnshanTJU, to double check I recompiled and tried again and got no errors. So try make clean before recompiling again and make runtest to make sure your code pass all the tests.

@shelhamer
Copy link
Member

make clean should fix this -- the registry count issue is usually seen
when switching from pre-registry to layer registry code, as is the case for
the current master to dev migration.

On Sat, Dec 20, 2014 at 11:17 PM, Sergio Guadarrama <
notifications@github.com> wrote:

@AnshanTJU https://github.com/AnshanTJU, to double check I recompiled
and tried again and got no errors. So try make clean before recompiling
again and make runtest to make sure your code pass all the tests.


Reply to this email directly or view it on GitHub
#1598 (comment).

@anshan-XR-ROB
Copy link

@sguada @shelhamer The latest caffe-dev code can not pass "make runtest" tests. The log information is attached below. The master branch code can pass all the tests. However, it can't support "poly" learning rate policy.

[ RUN ] NetTest/2.TestBottomNeedBackward
[ OK ] NetTest/2.TestBottomNeedBackward (2 ms)
[ RUN ] NetTest/2.TestReshape
F1221 20:18:28.124177 2767 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1)
*** Check failure stack trace: ***
@ 0x7f96f4b73a5d (unknown)
@ 0x7f96f4b77c57 (unknown)
@ 0x7f96f4b75ad9 (unknown)
@ 0x7f96f4b75ddd (unknown)
@ 0x803dd8 caffe::GetLayer<>()
@ 0x810247 caffe::Net<>::Init()
@ 0x8125de caffe::Net<>::Net()
@ 0x59e116 caffe::NetTest<>::InitNetFromProtoString()
@ 0x59dcd2 caffe::NetTest<>::InitReshapableNet()
@ 0x5ab090 caffe::NetTest_TestReshape_Test<>::TestBody()
@ 0x7b0c6d testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x7a4711 testing::Test::Run()
@ 0x7a47fa testing::TestInfo::Run()
@ 0x7a4a27 testing::TestCase::Run()
@ 0x7a99ff testing::internal::UnitTestImpl::RunAllTests()
@ 0x7a3c20 testing::UnitTest::Run()
@ 0x49a7ff main
@ 0x347bc1ecdd (unknown)
@ 0x49a559 (unknown)
make: *** [runtest] Aborted

@ducha-aiki
Copy link
Contributor

Great, thanks!

@sguada sguada mentioned this pull request Dec 21, 2014
@yulingzhou
Copy link

@sguada I trained the Googlenet with quick_solver.prototxt. And after 730000 iterations, the top-1 accuracy is just 25.57, 33.59, 39.39. Is this any problem? What is your result during training?

@sguada
Copy link
Contributor Author

sguada commented Dec 23, 2014

@yulingzhou it seems a bit low but its ok, mine was around 41 top-1 accuracy. With quick_solver.prototxt it gain most of the accuracy near the end.
To get to 50 top-1 needed around 1.5Million and 60 top-1 around 2Million.

If you want to get a reasonable good model faster, let's say 60 top-1, you can lower max_iterations: 600000 in the solver.

@anshan-XR-ROB
Copy link

@sguada @shelhamer The version of caffe-dev in Dec. 20 have unknown bugs which cause the registry count issue. I have downloaded the latest caffe-dev in an hour ago. It runs well. Thanks to @sguada for the great work.

@yulingzhou
Copy link

@sguada It's now 1840000 iterations, and the accuracy is just 35.73, 45.3, 51.64, which is much lower than yours. Currently lr is 0.0048. I trained the googlenet in the code from bvlc_googlenet branch. What may the problem be?

@sguada
Copy link
Contributor Author

sguada commented Dec 29, 2014

@yulingzhou I think is going ok. As I said until you get close to the max_iter the accuracy should grow slowly but steadily. When you get near the end you should expect a rapid increase in accuracy.
I will recommend you plot the log file using tools\extra\parse_log.sh and your favorite program for displaying data, I usually use gnuplot.

@seanbell
Copy link

seanbell commented Jan 4, 2015

@sguada Was this trained by first resizing/warping all training images to 256x256 (and then taking random 224x224 crops)? The dataset preparation details aren't mentioned above or on the ModelZoo page. Also, thanks for releasing the model!

@sguada
Copy link
Contributor Author

sguada commented Jan 4, 2015

@seanbell Yes I used the same pre-processed data as for caffe_reference model. Using a more elaborated data pre-processing, such as different scales and different aspect-ratios, should lead to better results.

@RazvanRanca
Copy link

Thanks for this!
Not a big deal, but just noticed the softmaxes are named as:

top: "loss1/loss1"
name: "loss1/loss"

top: "loss2/loss1"
name: "loss1/loss"

top: "loss3/loss3"
name: "loss3/loss3"

Typo?

@RobotiAi
Copy link

Thank you very much for your sharing!
However, I encounter a problem when try you solution. In train_val.prototxt, there is an item named mean_value, but the caffe cannot recognise that during training because there is no definition in proto file. Could you please tell me how to solve this problem? Thanks!

@anshan-XR-ROB
Copy link

You should use the latest version of caffe-dev. @RobotiAi

@RobotiAi
Copy link

@AnshanTJU Thank you so much!

@npit
Copy link

npit commented Mar 11, 2015

@sguada
I have been running your googleNet implementation. If I am not mistaken, the loss3 softmax represents the final classification layer, is that correct?
According to that, the network seems not to be learning anything:

Iteration 140000, Testing net (#0)
I0311 15:26:20.679927 4472 solver.cpp:315] Test net output #0: loss1/loss1 = 4.10975 (* 0.3 = 1.23292 loss)
I0311 15:26:20.679980 4472 solver.cpp:315] Test net output #1: loss1/top-1 = 0.18788
I0311 15:26:20.679991 4472 solver.cpp:315] Test net output #2: loss1/top-5 = 0.40446
I0311 15:26:20.680001 4472 solver.cpp:315] Test net output #3: loss2/loss1 = 6.90982 (* 0.3 = 2.07294 loss)
I0311 15:26:20.680008 4472 solver.cpp:315] Test net output #4: loss2/top-1 = 0.001
I0311 15:26:20.680016 4472 solver.cpp:315] Test net output #5: loss2/top-5 = 0.005
I0311 15:26:20.680027 4472 solver.cpp:315] Test net output #6: loss3/loss3 = 6.91083 (* 1 = 6.91083 loss)
I0311 15:26:20.680033 4472 solver.cpp:315] Test net output #7: loss3/top-1 = 0.001
I0311 15:26:20.680042 4472 solver.cpp:315] Test net output #8: loss3/top-5 = 0.005

@npit
Copy link

npit commented Apr 22, 2015

I finished training exactly googleNet exactly like @sguada.
As in the post above, something's up with the loss2,loss3 layers, it's like only loss1 is optimized.
Here's the last log entries:

I0329 02:24:37.016726 21069 solver.cpp:248] Iteration 2400000, loss = 9.71667
I0329 02:24:37.016762 21069 solver.cpp:266] Iteration 2400000, Testing net (#0)
I0329 02:29:10.859261 21069 solver.cpp:315] Test net output #0: loss1/loss1 = 1.84373 (* 0.3 = 0.553119 loss)
I0329 02:29:10.859345 21069 solver.cpp:315] Test net output #1: loss1/top-1 = 0.56186
I0329 02:29:10.859355 21069 solver.cpp:315] Test net output #2: loss1/top-5 = 0.805421
I0329 02:29:10.859365 21069 solver.cpp:315] Test net output #3: loss2/loss1 = 6.90962 (* 0.3 = 2.07289 loss)
I0329 02:29:10.859375 21069 solver.cpp:315] Test net output #4: loss2/top-1 = 0.001
I0329 02:29:10.859382 21069 solver.cpp:315] Test net output #5: loss2/top-5 = 0.005
I0329 02:29:10.859391 21069 solver.cpp:315] Test net output #6: loss3/loss3 = 6.90966 (* 1 = 6.90966 loss)
I0329 02:29:10.859400 21069 solver.cpp:315] Test net output #7: loss3/top-1 = 0.001
I0329 02:29:10.859407 21069 solver.cpp:315] Test net output #8: loss3/top-5 = 0.005

And the accuracy vs iterations graph:
copied_log log_iterations_test accuracy

To further illustrate the weirdness, I used @sguada 's provided .caffemodel file for feature extraction using the C++ tool, everything went fine.
Using my trained model however, the loss2 and loss3 layers output identical junk-like features for each image (the loss1/classifier layer works fine and produces similar features as sguada's model).
Trying extraction of various layers to pinpoint where and why this is happening, I found out that it happens in the 4th inception module.

Specifically, the output features of the below components start producing identical features per image.
inception_4b/pool_proj
inception_4b/1x1
inception_4b/3x3_reduce
inception_4b/5x5_reduce

The outputs of hese layers are passed on to the next inception module, and along to the loss2 , loss3 classifiers.
On the contrary, no inception layer processes the signal that is fed to the loss1 classifier, and thus it produces correct features.

I am attaching an image of the googlenet structure to show the inception module and the layers where this occurs (2.6 MB image):
googlenet_full

Any idea why this is happening? I'm guessing I should have stopped at first occurence of this nbehaviour.
Again, I used the prototxts and the data as given.

Thanks in advance.

@sguada
Copy link
Contributor Author

sguada commented Apr 22, 2015

@npit I'm not sure what went wrong with your training, but definitely the loss2 and loss3 indicate that the upper layers are not learning anything. A loss around 6.9 means that the network is guessing randomly. Probably it got a bad initialization and couldn't recover.
If you used the prototxt without modifications and you pre-process the data properly (including resizing and shuffling) you should give another try. Change the max_iterations=100000 if the 3 losses don't start decreasing after 10k-15k iterations something is wrong.

@npit
Copy link

npit commented Apr 23, 2015

I will, thanks for your response.
Also, I am guessing you had to edit the log parser to get the loss3 accuracy data from the log?
Because it seems I got the loss1 output, and thus the increase.

@dgolden1
Copy link
Contributor

@npit see #2350 for an updated log parser

@npit
Copy link

npit commented May 13, 2015

@drdan14 Thanks! Is there by any chance a bash version of the parser?

@dgolden1
Copy link
Contributor

@npit yes, it's sitting right next to the python version: https://github.com/BVLC/caffe/blob/master/tools/extra/parse_log.sh

Please move this sort of question to the Google Group: https://groups.google.com/forum/#!forum/caffe-users

@npit
Copy link

npit commented May 13, 2015

@drdan14 I meant of your updated log parser, not the standard one.

@dgolden1
Copy link
Contributor

Nope, WYSIWYG. But you can run the python version from the command line (type ./parse_log.py -h for help), so I don't know why you'd want the bash version if you prefer the python version's functionality.

@npit
Copy link

npit commented May 13, 2015

Allright, thanks.

@wangdelp
Copy link

wangdelp commented Jul 2, 2015

@sguada Hi Sguada, why there is a std field in "xavier" filler? Isn't the magnitude decided by the number of fan-in and fan-out units? Thank you.

@sguada
Copy link
Contributor Author

sguada commented Jul 2, 2015

It is not used by "xavier" filler, but left there just in case someone want
to use "gausian" filler.

Sergio

2015-07-02 9:29 GMT-07:00 Xeraph notifications@github.com:

@sguada https://github.com/sguada Hi Sguada, why there is a std field
in "xavier" filler? Isn't the magnitude decided by the number of fan-in and
fan-out units? Thank you.


Reply to this email directly or view it on GitHub
#1598 (comment).

@wangdelp
Copy link

wangdelp commented Jul 3, 2015

@sguada Got it. Thank you.

@npit
Copy link

npit commented Aug 29, 2015

@sguada May I ask how did you choose the poly learning rate and with the 0.5 parameter?
thanks

@sguada
Copy link
Contributor Author

sguada commented Aug 31, 2015

I tried different options and that one seem to be more consistent and
perform better.

Sergio

On Sat, Aug 29, 2015 at 5:30 AM, npit notifications@github.com wrote:

@sguada https://github.com/sguada May I ask how did you choose the poly
learning rate and with the 0.5 parameter?
thanks


Reply to this email directly or view it on GitHub
#1598 (comment).

@npit
Copy link

npit commented Sep 1, 2015

Thanks for the swift reply.
Wasn't there a graph that you showed the accuracy progress for the different learning rates?
Or was it different batch sizes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.