Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Caffe Timings for GoogleNet, VGG, AlexNet with cuDNN #1317

Closed
sguada opened this issue Oct 17, 2014 · 29 comments
Closed

Caffe Timings for GoogleNet, VGG, AlexNet with cuDNN #1317

sguada opened this issue Oct 17, 2014 · 29 comments

Comments

@sguada
Copy link
Contributor

sguada commented Oct 17, 2014

As part of my on going training of GoogleNet in Caffe (the wining entry of ImageNet-2014) I was doing some timings, and these are my findings:

[Comparison with Caffe_reference]

  • GoogleNet with cuDNN is (2.8x forward, 3.6x backward) slower than caffe_reference with cuDNN.
  • VGGNet_16Layers without cuDNN is (11.5x forward, 18.7x backward) slower than caffe_reference with cuDNN.
  • VGGNet_19Layers without cuDNN is (13.8x forward, 22x backward) slower than caffe_reference with cuDNN.

These experiments are run in one K40c using batch_size:128 in a server with 8-GPUs running other tasks.

[Comparison with cuDNN vs without cuDNN]

  • caffe_reference with cuDNN is (1.4x forward, 1.28x backward) faster than without cuDNN
    • Average Forward pass: 200.792 ms.
    • Average Backward pass: 310.973 ms.
    • Average Forward-Backward: 511.953 ms.
  • caffe_reference without cuDNN
    • Average Forward pass: 281.24 ms.
    • Average Backward pass: 398.719 ms.
    • Average Forward-Backward: 680.189 ms.
  • GoogleNet with cuDNN is (1.6x forward, 1.4x backward) faster than without cuDNN
    • Average Forward pass: 562.841 ms.
    • Average Backward pass: 1123.84 ms.
    • Average Forward-Backward: 1688.8 ms.
  • GoogleNet without cuDNN
    • Average Forward pass: 922.007 ms.
    • Average Backward pass: 1533.55 ms.
    • Average Forward-Backward: 2455.89 ms.

For the VGG networks I need to use batch_size: 64 to be able to fit them in memory, so I multiplied the times by 2.

  • VGG_16Layers with cuDNN is (1.2x, 1.12x) slower than without cuDNN
    • Average Forward pass: 2772 ms.
    • Average Backward pass: 6546.86 ms.
    • Average Forward-Backward: 9324.94 ms.
  • VGG_16Layers without cuDNN
    • Average Forward pass: 2298.68 ms.
    • Average Backward pass: 5825.2 ms.
    • Average Forward-Backward: 8124.48 ms.
  • VGG_19Layers with cuDNN is (1.22x, 1.36x) slower than without cuDNN
    • Average Forward pass: 3387.08 ms.
    • Average Backward pass: 7928.3 ms.
    • Average Forward-Backward: 11316.92 ms.
  • VGG_19Layers without cuDNN
    • Average Forward pass: 2769.9 ms.
    • Average Backward pass: 6850.64 ms.
    • Average Forward-Backward: 9623.26 ms.
@sguada sguada changed the title Caffe Timings for VGG, GoogleNet, AlexNet with cuDNN Caffe Timings for GoogleNet, VGG, AlexNet with cuDNN Oct 17, 2014
@thatguymike
Copy link
Contributor

How do we interpret the numbers at the top from the numbers at the bottom? e.g. 22-36% slower on VGG_19 on a single GPU, but above you say 13.8x forward, 22x backward. In your individual timings you show GoogleNet with cuDNN faster, but at the top slower.

@sguada
Copy link
Contributor Author

sguada commented Oct 17, 2014

In the top, I'm comparing the timings of GoogleNet and VGG models with respect to Caffe_reference model.

That's means that GoogleNet with cuDNN is 2.8 times (3.6 times) slower in the forward (backward) than Caffe_reference with cuDNN.

But I also added the comparison between GoogleNet with cuDNN vs GoogleNet without cuDNN and with cuDNN is 1.6 times (1.4 times) faster in the forward (backward) than without cuDNN.

These analysis means that cuDNN helps for GoogleNet and Caffe_reference but hurst for VGG models.

And that Caffe_reference is the fastest (although not the best), GoogleNet is pretty fast (state of the art) and VGG is pretty slow (but also state of the art).

@thatguymike
Copy link
Contributor

Now I understand, wording was a little confusing. ;-) VGG is an expensive network to train in several dimensions, but has neat attirbutes. Interesting that GoogleNet isn't "that bad" for training time (and memory footprint).

@sguada
Copy link
Contributor Author

sguada commented Oct 17, 2014

Yeah, I didn't want to mean that VGG are bad networks, they are great, and we have seen great results using them, they just require too many parameters, a lot of memory and are slow to train and test, but results are good 👍

@amiralush
Copy link

@sguada, thanks for the comparison. Can you please share the train/test networks definitions?
Also, if you can also display the loss/accuracy vs #iterations it would be helfpful. Have you noticed a different convergence rate or other differences when switching between caffe & cudnn?

@mkudelski
Copy link

@sguada, good job with the comparison! I can confirm that I obtained similar timings when playing with GoogLeNet architecture (when comparing it to the caffe_reference model).

I am curious about the GoogLeNet training procedure: assuming that you use the batch size of 128, how long does it take you to see any progress in learning (how many iterations)? And if you do see the progress, can I also ask about 1) the learning rate used, and 2) weights initialization? I would be glad to exchange some experience on that...

@amiralush
Copy link

@mkudelski I've implemented the GoogLeNet as well and am getting the same training times as reported by @sguada. The implementation is straight forward as described in their paper. The weights initialization is "xavier" and that's about it, works out of the box! you can see progress straight away after a couple of hundreds of iterations, if you don't than there's something wrong.
I've attached my training log for the first 20K iterations, this is for an imagenet scale dataset, not imagenet.
caffe_test_iters_graph

@mkudelski
Copy link

@amiralush Thanks for the info! BTW, do you also train on Tesla K40, with batch size of 128? Or maybe you use a smaller batch?

@amiralush
Copy link

Yes, 128 batch size, TeslaK40.

P.S
If you're short on memory you can trim some of the inception modules and losses. The convergence is pretty robust from my experiments.

@mkudelski
Copy link

@amiralush The very last question ralated to the plot: what was the learning rate value for this particular learning curve (I assume the rate was constant during first 20000 iterations...)? Thanks again :-)

@sguada
Copy link
Contributor Author

sguada commented Oct 21, 2014

@amiralush thanks for confirming my timings. It seems that before you uploaded a different graph containing the plots of other networks. Would like to talk about them? Are you plotting train or test loss?

@mkudelski For training GoogleNet I used batch_size: 32, as reported in the paper, I used batch_size: 128 for timing to make it easy the comparison.

@okn2020
Copy link

okn2020 commented Oct 23, 2014

Hi guys, could you post prototxts you used for googlenet (or add it to examples?) Thank you!

@amiralush
Copy link

@okn2020 I've made a PR #1367

@okn2020
Copy link

okn2020 commented Oct 27, 2014

@amiralush Thank you, will try this out!!

@shengen
Copy link

shengen commented Nov 10, 2014

@sguada Hi sguada, would you please describe how to initialize the first four layers of Net D in VGG's paper? I just wonder that how could we initialize Net D with Net A, since the number of parameters in each layer of those two nets are different. Thanks a lot.

@futurely
Copy link

#1169 (comment)

@mavenlin
Copy link
Contributor

CONCAT layer costs extra memory.
Put convolution result directly in the concatenated memory in a strided manner is fully doable with cudnn.

@yulingzhou
Copy link

@sguada Hi, would you share your prototxt and model of googlenet. And did you train the net in caffe manner? If not, could you share your training method with us?

@sguada
Copy link
Contributor Author

sguada commented Dec 19, 2014

Take a look at #1598 for my replica of GoogleNet, including the prototxt, solver and model.

@andresromero
Copy link

Hi Sergio!

I am testing your GoogleNet implementation with a personal dataset and I am running out of memory (I am using a GeRorce GTX 760 card with 2048 MB).

I have already tried to reduce the batch size (I even tested with batch_size: 1) but it is still running out memory. I was just wondering which Nvidia card did you use for your tests or how can I change my configuration files to run the GoogleNet on my card (for the AlexNet it runs flawlessly using batch_size: 96).

I would appreciate your help, thanks!

@ducha-aiki
Copy link
Contributor

@andresromero
Try to turn off testing while learning. (comment out test_iter and test_interval in solver)

@andresromero
Copy link

Thanks @ducha-aiki it worked!

@yuLiu24
Copy link

yuLiu24 commented Feb 27, 2015

Hi @andresromero
I met the same problem with you. For VGGnet, I use the Titan(6GB) card, and the batchsize =1,
but it always runs out of memory. How to solve it ? (comment out test_iter and test_interval in solve does not work)

@jmendozais
Copy link

Hi, I am training the VGG16 model with a K20 4 GB card, but it works just for batch sizes <= 10. How can I train the model with greater batch sizes ?

@ducha-aiki
Copy link
Contributor

Turn off testing -> ~2 times less memory consumption.

@yuhan210
Copy link

Does anyone have any ideas why using cuDNN makes things slower for some networks (e.g., VGG)?

@sjlee0407
Copy link

Hi, I want to know why forward pass is faster than backward pass. If you know about that, please tell me about that. thank you!

@gurkirt
Copy link

gurkirt commented Oct 14, 2016

@sjlee7748 check here

What I have observed from my is that backward pass in faster than forward pass without cuDNN and other way around if you compiled using cuDNN.

I guess it depends upon the implementation.

hope it helps.
Cheers,
Gurkirt

@raequin
Copy link

raequin commented Nov 11, 2016

Question: I'd like to train and test your GoogLeNet replica for my application where I have 512x512 grayscale images that can have one of four possible classifications, so can you point me in the direction of what I would need to modify in the prototxt for this situation? As you can see, I am new to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests