Fast Training of Convolutional Networks through FFTs #143

kloudkl · 2014-02-24T02:29:57Z

@forresti is working on low memory convolution. But the run time of the convolutional layer is still the biggest bottleneck of the network. Shall we investigate applying fast convolution such as FFT[1]?

[1] Michael Mathieu, Mikael Henaff, Yann LeCun. Fast Training of Convolutional Networks through FFTs. arXiv:1312.5851 [cs.CV]. 2013.

forresti · 2014-02-24T02:47:21Z

I just skimmed the new LeCun FFT paper. Seems reasonable -- we should include FFT (perhaps cuFFT) in the list of things to try for accelerating Caffe's convolution layers. In my experience, FFT can require ~2x extra memory space compared to direct convolution, but still probably less memory space than convolution using BLAS.

Krizhevsky's cuda-convnet is one of the baselines for LeCun's FFT paper. I haven't looked at the Krizhevsky code in a while... do we know how cuda-convnet's convolution compares to Caffe in terms of speed and memory usage?

forresti · 2014-02-24T06:10:15Z

cc: @moskewcz

kloudkl · 2014-02-25T09:31:50Z

This is a more involved problem than it looks like at the very first sight. There are quite a lot of arguments in the comments to this paper in the openreview.net.

One of the biggest concerns of mine is about the memory requirement that is quadratic to the input side length as discussed in the page 4 of the fourth version of the paper. The authors did not list the exact number for input of size n*n where n is greater than 64. In ImageNet, n is greater than 200 and the memory required would be more than 10 times larger which renders the proposed algorithm completely impractical unless the batchsize or the number of filters are reduced correspondingly. The trade off between efficiency and accuracy is not determined at all in the paper.

My understanding may not fully reflect the original authors. Please make your comments based on reading the paper and the reviews if you are interested.

s9xie · 2014-05-23T06:31:02Z

FYI, an implementation of this in Theano is out. The speedup looks really nice
http://benanne.github.io/2014/05/12/fft-convolutions-in-theano.html

shelhamer · 2014-05-23T06:50:01Z

Thanks for the heads-up!

I'd like to see this explored in Caffe too. @jeffdonahue and I talked to
the authors at ICLR and it seems the speedup is potentially quite good for
the low stride, large input regime. There could be value in generalizing
our convolution layer to compute by BLAS or FFT by a switch so we can
investigate the tradeoffs.

A PR for this would be welcome!

On Thu, May 22, 2014 at 11:31 PM, s9xie notifications@github.com wrote:

FYI, an implementation of this in Theano is out. The speedup looks really
nice
http://benanne.github.io/2014/05/12/fft-convolutions-in-theano.html

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/143#issuecomment-43975336
.

borisgin · 2014-06-10T12:00:37Z

Hi, I did quick prototype of FFT implementation for ConvolutionalLayer::Forward, and pushed it to github. Current verioin is CPU only, based on MKL (Free MKL is available for download here: https://software.intel.com/en-us/non-commercial-software-development ) . A few observations based on trying FFT for "imagenet-like" nets: 1) fft-based convolution makes sense only when (kernel_size/stride is > 4) since overhead is quite large. There is a switch fft_on in the code so you can experiment with different ratios. 2) There is overhead related to doing FFT for weights, so it's better to use FFT for large batches 3) The memory overhead is significant. 4) Current implementation of FFT does not utilize completely all cores, so I added OpenMP to speed it up. You can switch openmp off in Makefile, if you don't want it. You can get fft version "git clone -b fft https://github.com/borisgin/caffe/".

kloudkl · 2014-06-11T00:57:54Z

It's not strange that you choose to prototype it with MKL. Based on your email, I guess you're working for Intel.

Do your think it's possible to enable the FFT library to seamlessly switch between MKL, FFTW, cuFFT and clFFT by wrapping them in a unified API using the Adaptor design pattern? BTW, it would be cleaner to put the new implementation in FFTConvolutionalLayer.

I strongly expect that you open a pull request.

forresti · 2014-06-11T02:12:23Z

Thanks for working on this FFT-based convolution!

To get an idea of the speed, sometimes we benchmark custom conv layer implementations by timing the convolution for each stage of Alexnet. A more broad sweep of the design space would be interesting, but testing the convolution speed on Alexnet is usually a good start. Before we get too gung-ho about the FFT performance... how fast is it?

forresti · 2014-06-11T02:13:57Z

Bonus points if we can get gflops/s or similar perf metrics for a vanilla CPU convolution (e.g. Caffe's default CPU conv) compared to the new CPU FFT conv code.

borisgin · 2014-06-11T06:27:36Z

I used MKL becasue it is fastest , and becasue it is free for academy :). I will wrap-up FFT into some API, so you can switch between MKL and FFTW in the same way as we switch between different BLAS libraries.

bhack · 2014-06-15T10:47:44Z

Please tale a look also at the PR merged comments about issue on multi backend batched gemm Theano/Theano#1870

borisgin · 2014-06-22T11:37:51Z

Hi, I pushed Convolutional layer with fft-based Forward() . There is no FFT support in Backward() yet.

The implementation is based only in FFTW3 library. It was tested both with FFTW3 and with MKL
The current version is CPU-only . Is anybody interested in doing CUDA-version?
Into addition it supports OpenMP to utilize all cores. Was tested with gcc -fopenmp.
The pushed version is rebased wrt git, and ready for PR.

forresti · 2014-06-22T16:00:13Z

How fast?

borisgin · 2014-06-24T10:06:45Z

Based on current CPU implementation (FFT+openMP), my impression is that FFT - based convolutional layer makes sense only for large kernels ( kernel_size / stride >= 10). There are more details on benchmark below.
Should I open PR for FFT, or it's better to leave FFT as a private branch? I can add FFT to backward(), so this code can be useful for experiment with large kernels ( > 13) , but for most of current topologies there is no gain.

More details on benchmarking:

I modified net_speed-benchmark to test Forward() only.
Took examples/imagenet topology and modified two first convolutonal layers:

batch = 128
stride = 1
kernel = 9, 11, 13, 15
for each kernel I slightly changed crop size in data layers to make map dimensions FFT-friendly
The results are below (time is seconds for 10 forward iterations).

layer kernel input output base,sec fft,sec

conv1 15 128x3x242x242 128x96x228x228 79 44
conv2 15 128x96x114x114 128x256x104x104 549 331

conv1 13 128x3x244x244 128x96x232x232 58 45
conv2 13 128x96x116x116 128x256x108x108 431 330

conv1 11 128x3x246x246 128x96x236x236 44 41
conv2 11 128x96x118x118 128x256x112x112 336 323

conv1 9 128x3x248x248 128x96x240x240 34 43
conv2 9 128x96x120x120 128x256x116x116 256 333 =================================================

bhack · 2014-06-24T12:12:58Z

What is the speedup relative to this benchmark?

borisgin · 2014-06-24T12:58:06Z

The bench, described in "Even faster convolutional code, compares two GPU implementations: cuda-fft vs cuda-convnet. I run the benchmark based on imagenet on CPU, using MKL implementation both for fftw and for gemm.

bhack · 2014-06-24T13:04:12Z

Yes i know. What i mean is that probably that there some speedup are given by GPU batched version of FFT and GEMM. Seems that there is also a small speedup factor (1.34x) for small one at:
input shape(64, 3, 96, 96)
filter shape(128, 3, 16, 16)

borisgin · 2014-06-24T13:11:43Z

I did try FFTW batch implementation which should be CPU version of batch FFT, but I did not see any benefit from it. So instead I parallelized FFTs between different cores through openMP.

kloudkl · 2014-06-24T14:23:25Z

@borisgin, you have implemented two versions. They are quite worthwhile to open a PR so that further discussions can move on with your codes.

borisgin · 2014-06-24T16:07:32Z

Based on Vtune, it looks like current C++ implementation of complex operation is not very effecient, and that there is some big potential for speed-up. I want to do some extra performance tuning before opening PR

borisgin · 2014-06-25T17:46:20Z

Hi, I improved FFT speed by 2x, by replacing standard C++ implementation of complex multiplication.
So current break-even kernel size = 7.

kloudkl · 2014-06-30T13:09:11Z

It's time to close this issue since it is solved by @borisgin in #544.

shelhamer added the enhancement label Feb 24, 2014

kloudkl closed this as completed Jun 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Training of Convolutional Networks through FFTs #143

Fast Training of Convolutional Networks through FFTs #143

kloudkl commented Feb 24, 2014

forresti commented Feb 24, 2014

forresti commented Feb 24, 2014

kloudkl commented Feb 25, 2014

s9xie commented May 23, 2014

shelhamer commented May 23, 2014

borisgin commented Jun 10, 2014

kloudkl commented Jun 11, 2014

forresti commented Jun 11, 2014

forresti commented Jun 11, 2014

borisgin commented Jun 11, 2014

bhack commented Jun 15, 2014

borisgin commented Jun 22, 2014

forresti commented Jun 22, 2014

borisgin commented Jun 24, 2014

bhack commented Jun 24, 2014

borisgin commented Jun 24, 2014

bhack commented Jun 24, 2014

borisgin commented Jun 24, 2014

kloudkl commented Jun 24, 2014

borisgin commented Jun 24, 2014

borisgin commented Jun 25, 2014

kloudkl commented Jun 30, 2014

Fast Training of Convolutional Networks through FFTs #143

Fast Training of Convolutional Networks through FFTs #143

Comments

kloudkl commented Feb 24, 2014

forresti commented Feb 24, 2014

forresti commented Feb 24, 2014

kloudkl commented Feb 25, 2014

s9xie commented May 23, 2014

shelhamer commented May 23, 2014

borisgin commented Jun 10, 2014

kloudkl commented Jun 11, 2014

forresti commented Jun 11, 2014

forresti commented Jun 11, 2014

borisgin commented Jun 11, 2014

bhack commented Jun 15, 2014

borisgin commented Jun 22, 2014

forresti commented Jun 22, 2014

borisgin commented Jun 24, 2014

bhack commented Jun 24, 2014

borisgin commented Jun 24, 2014

bhack commented Jun 24, 2014

borisgin commented Jun 24, 2014

kloudkl commented Jun 24, 2014

borisgin commented Jun 24, 2014

borisgin commented Jun 25, 2014

kloudkl commented Jun 30, 2014