Benchmark TensorFlow #66

soumith · 2015-11-11T08:25:43Z

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners.
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected.

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):

in-place ReLU seems non-existent in practice.
- Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:

Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
VGG with batchsize 64 goes Out of Memory (Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG). ~~The largest batch-size I could fit is 32 (tried 32, 64).~~
I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

AlexNet (One Weird Trick paper) - Input 128x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow	326	96	230

Overfeat [fast] - Input 128x3x231x231

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow	1084	316	768

OxfordNet [Model-A] - Input 64x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow	1840	545	1295

GoogleNet V1 - Input 16x3x224x224

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R2 (Torch)	564	174	390
TensorFlow	590	54	536

Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.

There you go.

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.

soumith · 2015-11-11T08:32:05Z

The benchmark scripts and raw outputs are located here: https://github.com/soumith/convnet-benchmarks/tree/master/tensorflow

scott-gray · 2015-11-11T09:21:29Z

The lack of in place operations is rather surprising. Once you have the full DAG it should be rather easy to apply a liveness algorithm to it to optimize tensor allocations. For an example see this: http://www.diku.dk/hjemmesider/ansatte/torbenm/ICD/Register.pdf (just replace register with tensor).

I'm kind of curious if there's any support for automatically compounding operations together or of leveraging kernels that have some compounding built in (like the alpha/beta params of gemm). I'm pretty close to maximizing the amount of compounding that's possible in my benchmark networks. And because I write all my own kernels I can further compound things that aren't possible with closed source libraries like cuDNN. For example, I'm now able to compute the mean along the PQN dimension inside the conv and gemm kernels at no cost. This cuts down the bandwidth required by batch norm in fprop by a third.

Though on the whole I think TensorFlow seems like a great platform to build on. I'd say there's a good chance my kernels will make their way there sooner rather than later. You can find new benchmarks of my latest winograd kernels in the updated paper here: http://arxiv.org/abs/1509.09308

What I'll be working on next is basically going to be taking a lot of what I learned implementing winograd and refreshing all of my conv/pooling/gemm kernels to support very small minibatches at near full utilization. This should have a big impact on the level at which you can scale these networks and the speed at which they converge. Here's a great paper exploring this: http://arxiv.org/abs/1509.04210

yuzcccc · 2015-11-11T12:55:32Z

Hi, I strongly recommand to add mxnet https://github.com/dmlc/mxnet into comparision which in my opinion may be the fastest DL library :)

mavenlin · 2015-11-11T13:48:19Z

+1 for benchmarking mxnet, the fastest now.

strongbanker · 2015-11-11T14:30:35Z

+1 for benchmarking mxnet

fvisin · 2015-11-11T15:23:53Z

I would also love to see a comparison with Theano http://deeplearning.net/software/theano/ as it is another widely adopted deep learning library.

nkoumchatzky · 2015-11-11T15:28:18Z

Thanks for benchmarking!

aaronwro · 2015-11-11T15:59:37Z

+1 would love to see tensorflow benchmarked against mxnet, Theano, Autograd for Torch, and Caffe.

vincentvanhoucke · 2015-11-11T16:01:05Z

Thanks @soumith! Yes, our only launch criterion for convnets was 'GoogLeNet within distance from CuDNN[R2]', and we've punted on a lot of performance work, including upgrading CuDNN, until after the initial release. Expect a lot of movement on that front in the coming weeks.

soumith · 2015-11-11T16:26:02Z

@aaronwro @fvisin it's already benchmarked against Torch, Theano, Caffe. Look at the readme on the main page ( https://github.com/soumith/convnet-benchmarks/blob/master/README.md ).
I definitely need to pull my socks up and benchmark MXNet and Chainer.

@vincentvanhoucke thanks for your response. I assumed that you'll fix it over the next weeks / months :)

vincentvanhoucke · 2015-11-11T16:29:43Z

@scott-gray let us know if you need help with compounding / graph rewriting. The graph representation is meant to make these kinds of operations possible, and the common subexpression elimination that TF currently uses is also meant as a demonstration of that. I suspect we might need to do a bit more to provide good APIs to make it easier to bake in compound kernels.

soumith · 2015-11-11T16:33:17Z

there seems to be some misinterpretation by random people in social media that because I work for Facebook, I'm attacking TensorFlow. That seems super weird, because I love the vision of TensorFlow, and there's no competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral, I've been running them forever now, sad that people misinterpret the slightest of things.
cc: @vincentvanhoucke

clementfarabet · 2015-11-11T16:35:18Z

will defend Soumith on this one – he has indeed been running these
benchmarks for quite some time, and complete neutrality.

On Wed, Nov 11, 2015 at 11:33 AM, Soumith Chintala <notifications@github.com

wrote:

there seems to be some misinterpretation by random people in social media
that because I work for Facebook, I'm attacking TensorFlow. That seems
super weird, because I love the vision of TensorFlow, and there's no
competition (one can write a XXX frontend for a TensorFlow backend).

My benchmarks have always been independently run, and completely neutral,
I've been running them forever now, sad that people misinterpret the
slightest of things.
cc: @vincentvanhoucke https://github.com/vincentvanhoucke

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

fvisin · 2015-11-11T16:35:56Z

@soumith Excellent, thank you!!

vincentvanhoucke · 2015-11-11T16:37:02Z

@soumith no good deed goes unpunished ;) Please don't let this deter you from providing this valuable service to the community!

Yangqing · 2015-11-11T16:37:27Z

@soumith , I am sorry that some people interpreted things that way. I've always appreciated your benchmark, which creates a great atmosphere for us to look at bottlenecks and push forward the field as a whole community. We all owe you a big debt of gratitude.

aaronwro · 2015-11-11T16:37:46Z

@soumith thanks!

jdemouth · 2015-11-11T16:52:02Z

As always, that's super interesting. Thanks for pushing all of us toward more performance.

tqchen · 2015-11-11T17:22:49Z

For memory optimizations, what we have found is that inplace optimization does not matter that much, if the allocator is smart enough to do a static allocation before running the graph(as opposed to relying on a dynamic allocator). We have detailed what can be done here

https://mxnet.readthedocs.org/en/latest/developer-guide/note_memory.html

Which I assume applies to computation graph frameworks such as TF, caffe2 and CGT.
@vincentvanhoucke @Yangqing

tqchen · 2015-11-11T17:25:00Z

The general idea is not only to share memory of same shape(i.e. inplace) , but also different shapes and size

rajatmonga · 2015-11-11T17:29:47Z

@soumith Thanks for running the benchmarks! As @vincentvanhoucke noted in this thread, our goal was to get an early release out so users can start playing with it and provide feedback on what they care about. We are committed to making TensorFlow fast and are actively working on the performance issues you highlight here.

alexbw · 2015-11-11T17:42:06Z

@soumith You're doing a good deed! Haters gonna hate.

piiswrong · 2015-11-11T18:35:21Z

I'm a little confused by the number. 1300 samples/sec seems too fast even for alexnet on single TitanX. Is this standard training, e.g. io+forward+backward+update, or just forward+backward?

kyieldmark · 2015-11-11T18:35:26Z

Nice work.

antinucleon · 2015-11-11T18:44:59Z

@piiswrong I will help @soumith make the benchmark script.

Anyway we opened everything since beginning. The main purpose is learning from each other but not advertise boring number.

koraykv · 2015-11-11T18:53:49Z

I will also add my support to Soumith. He has been running these benchmarks for sometime with complete transparency and neutrality.

sermanet · 2015-11-11T19:30:11Z

@koraykv +1, thanks Soumith!

soumith · 2015-11-11T20:09:49Z

Someone on reddit suggested that I build tensorflow from source, to fix speed issues. That did not help, It produces the same numbers as the pip version on my alexnet script :

https://gist.github.com/soumith/11acc2f0dbc5212ea372

cgel · 2016-02-16T23:02:49Z

Tf 0.7.0 released!
Looking forward to the updated benchmarks.

MikalaiDrabovich · 2016-02-17T20:37:20Z

👍 +1:

ronghanghu · 2016-02-23T21:28:53Z

Great results 👍 👍 👍

Looking forward to the results with cuDNN v4

Madder · 2016-02-23T23:25:46Z

+1

On Tue, Feb 23, 2016 at 10:29 PM, Ronghang Hu notifications@github.com
wrote:

Great results [image: 👍] [image: 👍] [image: 👍]

Looking forward to the results with cuDNN v4

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

soumith · 2016-02-29T00:28:02Z

As requested, TF 0.7 + CuDNN R4 has been benchmarked. CuDNN R4 + Torch has also been benchmarked as a baseline.

Within the span of Nervana's Neon, Torch + CuDNN4, TensorFlow + CuDNN4 (and Caffe + CuDNN is likely in the same ballpark as torch), TensorFlow ( at commit tensorflow/tensorflow@1d4f00d ) still lags behind the others by 2x to 3x performance on Alexnet, VGG and Googlenet. It is within 1.5x of Overfeat.

soumith · 2016-02-29T00:30:22Z

For full details, see the main README.md: https://github.com/soumith/convnet-benchmarks/blob/master/README.md and the raw logs are located here: 2888b23

soumith · 2016-02-29T00:32:29Z

i have not changed the benchmark scripts in any way, so if the TF benchmark scripts need any change (such as new allocator settings etc.), I welcome the TF folks to let me know.

rajatmonga · 2016-02-29T02:39:12Z

Thanks Soumith@, this isn't quite where we had seen our numbers at, but we
will look at the tests again and ping you if we notice something.

Thanks again for running these benchmarks!

On Sun, Feb 28, 2016, 4:32 PM Soumith Chintala notifications@github.com
wrote:

i have not changed the benchmark scripts in any way, so if the TF
benchmark scripts need any change (such as new allocator settings etc.), I
welcome the TF folks to let me know.

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

soumith · 2016-02-29T02:40:30Z

Thanks Rajat, happy to investigate further. I built TF from source, and configured it with CUDA 7.5 + CuDNN-4, if that helps. The commit is tensorflow/tensorflow@1d4f00d

nryant · 2016-02-29T07:52:55Z

I've had similar numbers using CUDA 7.0, cuDNN v4, and tensorflow/tensorflow@b889710 on a Titan X. Tried fiddling with device placement and the session config, but it made no material difference in the results. @rajatmonga, out of curiosity are you using cuDNN and nvcc internally, or gpucc?

soumith · 2016-03-02T06:23:01Z

@nryant Thanks for the additional data point. I am honestly very nervous whenever I have to deliver any negative news on convnet-benchmarks. fwiw, @spezzer on reddit also confirmed that it was a data layout thing as well https://www.reddit.com/r/MachineLearning/comments/487fmo/convnetbenchmarks_updated_with_numbers_for/d0i7ord .
I'm closing this issue now, as we have benchmarked tensorflow across multiple versions and given it enough time and data. Will of course keep updating it over time as appropriate.
Thanks all.

vrv · 2016-03-02T06:52:26Z

@soumith: I think in this case it's a combination of layout and some Eigen improvements that hadn't made its way upstream -- we're looking at both of these actively. Thanks again for your effort -- we'll let you know when it makes sense to update the numbers (and provide our own for comparison).

thinxer · 2016-03-06T05:08:46Z

A recent commit adds NCHW support for BiasAdd, which results in about 40% speed up.

tensorflow/tensorflow@d6f3ebf

vrv · 2016-03-06T05:32:23Z

@thinxer: we'll let @soumith know when to update the numbers, but thanks for noticing :)

soumith · 2016-03-06T05:35:18Z

That's really cool, thanks for letting me know. I'm doing a new, complete set of benchmarks for deep learning, not just convnets, will cover this commit in them

rajatmonga · 2016-03-06T07:50:21Z

Thanks @soumith! No rush though.

We have most of the pieces together to support NCHW and expect to see more
gains once we update the models to use that. Will ping you once that is
ready as well. This commit helps quite a bit (was another regression on our
part). Of course the layout changes will mostly help convnets and not other
kinds of models.

On Sat, Mar 5, 2016 at 9:35 PM Soumith Chintala notifications@github.com
wrote:

That's really cool, thanks for letting me know. I'm doing a new, complete
set of benchmarks for deep learning, not just convnets, will cover this
commit in them

—
Reply to this email directly or view it on GitHub
#66 (comment)
.

shendiaomo · 2016-03-16T07:43:53Z

How about tensorflow 0.7？

ghost · 2016-03-18T10:08:50Z

Thanks for the benchmark @soumith . Looking forward for new updated TensorFlow.

soumith mentioned this issue Nov 11, 2015

Benchmark tensorflow #65

Closed

jeffdonahue mentioned this issue Nov 11, 2015

AlexNet with FC layers: backward is very slow? tensorflow/tensorflow#113

Closed

soumith closed this as completed Mar 2, 2016

soumith mentioned this issue Apr 14, 2016

DeepMark #101

Open

MycChiu mentioned this issue Jul 5, 2016

Simplifying the code handling dim_ordering for tensorflow back-end keras-team/keras#3149

Closed

yamins81 mentioned this issue Jul 15, 2016

Speed issues in code dicarlolab/tnn#6

Closed

cancan101 mentioned this issue Mar 9, 2017

[feature] Smarter Handling of Image Data Format tensorflow/tensorflow#8227

Closed

This was referenced May 25, 2017

Pin tensorflow to latest version 1.1.0 HEP-DL/proton_decay_study#28

Closed

Pin tensorflow to latest version 1.1.0 HEP-DL/proton_decay_study#35

Closed

This was referenced Jul 19, 2017

Pin tensorflow to latest version 1.2.1 HEP-DL/proton_decay_study#61

Closed

Pin tensorflow to latest version 1.2.1 HEP-DL/proton_decay_study#62

Closed

This was referenced May 2, 2020

Machine Learning Tutorials guevara/read-it-later#6420

Open

Machine Learning Tutorials guevara/read-it-later#6421

Open

Machine Learning Tutorials guevara/read-it-later#6422

Open

Benchmark TensorFlow #66

Benchmark TensorFlow #66

Comments

soumith commented Nov 11, 2015

soumith commented Nov 11, 2015

scott-gray commented Nov 11, 2015

yuzcccc commented Nov 11, 2015

mavenlin commented Nov 11, 2015

strongbanker commented Nov 11, 2015

fvisin commented Nov 11, 2015

nkoumchatzky commented Nov 11, 2015

aaronwro commented Nov 11, 2015

vincentvanhoucke commented Nov 11, 2015

soumith commented Nov 11, 2015

vincentvanhoucke commented Nov 11, 2015

soumith commented Nov 11, 2015

clementfarabet commented Nov 11, 2015

fvisin commented Nov 11, 2015

vincentvanhoucke commented Nov 11, 2015

Yangqing commented Nov 11, 2015

aaronwro commented Nov 11, 2015

jdemouth commented Nov 11, 2015

tqchen commented Nov 11, 2015

tqchen commented Nov 11, 2015

rajatmonga commented Nov 11, 2015

alexbw commented Nov 11, 2015

piiswrong commented Nov 11, 2015

kyieldmark commented Nov 11, 2015

antinucleon commented Nov 11, 2015

koraykv commented Nov 11, 2015

sermanet commented Nov 11, 2015

soumith commented Nov 11, 2015

cgel commented Feb 16, 2016

MikalaiDrabovich commented Feb 17, 2016

ronghanghu commented Feb 23, 2016

Madder commented Feb 23, 2016

soumith commented Feb 29, 2016

soumith commented Feb 29, 2016

soumith commented Feb 29, 2016

rajatmonga commented Feb 29, 2016

soumith commented Feb 29, 2016

nryant commented Feb 29, 2016

soumith commented Mar 2, 2016

vrv commented Mar 2, 2016

thinxer commented Mar 6, 2016

vrv commented Mar 6, 2016

soumith commented Mar 6, 2016

rajatmonga commented Mar 6, 2016

shendiaomo commented Mar 16, 2016

ghost commented Mar 18, 2016