Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast Training of Convolutional Networks through FFTs #143

Closed
kloudkl opened this issue Feb 24, 2014 · 22 comments
Closed

Fast Training of Convolutional Networks through FFTs #143

kloudkl opened this issue Feb 24, 2014 · 22 comments

Comments

@kloudkl
Copy link
Contributor

kloudkl commented Feb 24, 2014

@forresti is working on low memory convolution. But the run time of the convolutional layer is still the biggest bottleneck of the network. Shall we investigate applying fast convolution such as FFT[1]?

[1] Michael Mathieu, Mikael Henaff, Yann LeCun. Fast Training of Convolutional Networks through FFTs. arXiv:1312.5851 [cs.CV]. 2013.

@forresti
Copy link
Contributor

I just skimmed the new LeCun FFT paper. Seems reasonable -- we should include FFT (perhaps cuFFT) in the list of things to try for accelerating Caffe's convolution layers. In my experience, FFT can require ~2x extra memory space compared to direct convolution, but still probably less memory space than convolution using BLAS.

Krizhevsky's cuda-convnet is one of the baselines for LeCun's FFT paper. I haven't looked at the Krizhevsky code in a while... do we know how cuda-convnet's convolution compares to Caffe in terms of speed and memory usage?

@forresti
Copy link
Contributor

cc: @moskewcz

@kloudkl
Copy link
Contributor Author

kloudkl commented Feb 25, 2014

This is a more involved problem than it looks like at the very first sight. There are quite a lot of arguments in the comments to this paper in the openreview.net.

One of the biggest concerns of mine is about the memory requirement that is quadratic to the input side length as discussed in the page 4 of the fourth version of the paper. The authors did not list the exact number for input of size n*n where n is greater than 64. In ImageNet, n is greater than 200 and the memory required would be more than 10 times larger which renders the proposed algorithm completely impractical unless the batchsize or the number of filters are reduced correspondingly. The trade off between efficiency and accuracy is not determined at all in the paper.

My understanding may not fully reflect the original authors. Please make your comments based on reading the paper and the reviews if you are interested.

@s9xie
Copy link

s9xie commented May 23, 2014

FYI, an implementation of this in Theano is out. The speedup looks really nice
http://benanne.github.io/2014/05/12/fft-convolutions-in-theano.html

@shelhamer
Copy link
Member

Thanks for the heads-up!

I'd like to see this explored in Caffe too. @jeffdonahue and I talked to
the authors at ICLR and it seems the speedup is potentially quite good for
the low stride, large input regime. There could be value in generalizing
our convolution layer to compute by BLAS or FFT by a switch so we can
investigate the tradeoffs.

A PR for this would be welcome!

On Thu, May 22, 2014 at 11:31 PM, s9xie notifications@github.com wrote:

FYI, an implementation of this in Theano is out. The speedup looks really
nice
http://benanne.github.io/2014/05/12/fft-convolutions-in-theano.html


Reply to this email directly or view it on GitHubhttps://github.com//issues/143#issuecomment-43975336
.

@borisgin
Copy link

Hi, I did quick prototype of FFT implementation for ConvolutionalLayer::Forward, and pushed it to github. Current verioin is CPU only, based on MKL (Free MKL is available for download here: https://software.intel.com/en-us/non-commercial-software-development ) . A few observations based on trying FFT for "imagenet-like" nets: 1) fft-based convolution makes sense only when (kernel_size/stride is > 4) since overhead is quite large. There is a switch fft_on in the code so you can experiment with different ratios. 2) There is overhead related to doing FFT for weights, so it's better to use FFT for large batches 3) The memory overhead is significant. 4) Current implementation of FFT does not utilize completely all cores, so I added OpenMP to speed it up. You can switch openmp off in Makefile, if you don't want it. You can get fft version "git clone -b fft https://github.com/borisgin/caffe/".

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 11, 2014

It's not strange that you choose to prototype it with MKL. Based on your email, I guess you're working for Intel.

Do your think it's possible to enable the FFT library to seamlessly switch between MKL, FFTW, cuFFT and clFFT by wrapping them in a unified API using the Adaptor design pattern? BTW, it would be cleaner to put the new implementation in FFTConvolutionalLayer.

I strongly expect that you open a pull request.

@forresti
Copy link
Contributor

Thanks for working on this FFT-based convolution!

To get an idea of the speed, sometimes we benchmark custom conv layer implementations by timing the convolution for each stage of Alexnet. A more broad sweep of the design space would be interesting, but testing the convolution speed on Alexnet is usually a good start. Before we get too gung-ho about the FFT performance... how fast is it?

@forresti
Copy link
Contributor

Bonus points if we can get gflops/s or similar perf metrics for a vanilla CPU convolution (e.g. Caffe's default CPU conv) compared to the new CPU FFT conv code.

@borisgin
Copy link

I used MKL becasue it is fastest , and becasue it is free for academy :). I will wrap-up FFT into some API, so you can switch between MKL and FFTW in the same way as we switch between different BLAS libraries.

@bhack
Copy link
Contributor

bhack commented Jun 15, 2014

Please tale a look also at the PR merged comments about issue on multi backend batched gemm Theano/Theano#1870

@borisgin
Copy link

Hi, I pushed Convolutional layer with fft-based Forward() . There is no FFT support in Backward() yet.

  • The implementation is based only in FFTW3 library. It was tested both with FFTW3 and with MKL
  • The current version is CPU-only . Is anybody interested in doing CUDA-version?
  • Into addition it supports OpenMP to utilize all cores. Was tested with gcc -fopenmp.
    The pushed version is rebased wrt git, and ready for PR.

@forresti
Copy link
Contributor

How fast?

@borisgin
Copy link

Based on current CPU implementation (FFT+openMP), my impression is that FFT - based convolutional layer makes sense only for large kernels ( kernel_size / stride >= 10). There are more details on benchmark below.
Should I open PR for FFT, or it's better to leave FFT as a private branch? I can add FFT to backward(), so this code can be useful for experiment with large kernels ( > 13) , but for most of current topologies there is no gain.

More details on benchmarking:

  1. I modified net_speed-benchmark to test Forward() only.
  2. Took examples/imagenet topology and modified two first convolutonal layers:
  • batch = 128
  • stride = 1
  • kernel = 9, 11, 13, 15
  • for each kernel I slightly changed crop size in data layers to make map dimensions FFT-friendly
    The results are below (time is seconds for 10 forward iterations).

layer kernel input output base,sec fft,sec


conv1 15 128x3x242x242 128x96x228x228 79 44
conv2 15 128x96x114x114 128x256x104x104 549 331


conv1 13 128x3x244x244 128x96x232x232 58 45
conv2 13 128x96x116x116 128x256x108x108 431 330


conv1 11 128x3x246x246 128x96x236x236 44 41
conv2 11 128x96x118x118 128x256x112x112 336 323


conv1 9 128x3x248x248 128x96x240x240 34 43
conv2 9 128x96x120x120 128x256x116x116 256 333 =================================================

@bhack
Copy link
Contributor

bhack commented Jun 24, 2014

What is the speedup relative to this benchmark?

@borisgin
Copy link

The bench, described in "Even faster convolutional code, compares two GPU implementations: cuda-fft vs cuda-convnet. I run the benchmark based on imagenet on CPU, using MKL implementation both for fftw and for gemm.

@bhack
Copy link
Contributor

bhack commented Jun 24, 2014

Yes i know. What i mean is that probably that there some speedup are given by GPU batched version of FFT and GEMM. Seems that there is also a small speedup factor (1.34x) for small one at:
input shape(64, 3, 96, 96)
filter shape(128, 3, 16, 16)

@borisgin
Copy link

I did try FFTW batch implementation which should be CPU version of batch FFT, but I did not see any benefit from it. So instead I parallelized FFTs between different cores through openMP.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 24, 2014

@borisgin, you have implemented two versions. They are quite worthwhile to open a PR so that further discussions can move on with your codes.

@borisgin
Copy link

Based on Vtune, it looks like current C++ implementation of complex operation is not very effecient, and that there is some big potential for speed-up. I want to do some extra performance tuning before opening PR

@borisgin
Copy link

Hi, I improved FFT speed by 2x, by replacing standard C++ implementation of complex multiplication.
So current break-even kernel size = 7.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 30, 2014

It's time to close this issue since it is solved by @borisgin in #544.

@kloudkl kloudkl closed this as completed Jun 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants