-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fast Training of Convolutional Networks through FFTs #143
Comments
I just skimmed the new LeCun FFT paper. Seems reasonable -- we should include FFT (perhaps cuFFT) in the list of things to try for accelerating Caffe's convolution layers. In my experience, FFT can require ~2x extra memory space compared to direct convolution, but still probably less memory space than convolution using BLAS. Krizhevsky's cuda-convnet is one of the baselines for LeCun's FFT paper. I haven't looked at the Krizhevsky code in a while... do we know how cuda-convnet's convolution compares to Caffe in terms of speed and memory usage? |
cc: @moskewcz |
This is a more involved problem than it looks like at the very first sight. There are quite a lot of arguments in the comments to this paper in the openreview.net. One of the biggest concerns of mine is about the memory requirement that is quadratic to the input side length as discussed in the page 4 of the fourth version of the paper. The authors did not list the exact number for input of size n*n where n is greater than 64. In ImageNet, n is greater than 200 and the memory required would be more than 10 times larger which renders the proposed algorithm completely impractical unless the batchsize or the number of filters are reduced correspondingly. The trade off between efficiency and accuracy is not determined at all in the paper. My understanding may not fully reflect the original authors. Please make your comments based on reading the paper and the reviews if you are interested. |
FYI, an implementation of this in Theano is out. The speedup looks really nice |
Thanks for the heads-up! I'd like to see this explored in Caffe too. @jeffdonahue and I talked to A PR for this would be welcome! On Thu, May 22, 2014 at 11:31 PM, s9xie notifications@github.com wrote:
|
Hi, I did quick prototype of FFT implementation for ConvolutionalLayer::Forward, and pushed it to github. Current verioin is CPU only, based on MKL (Free MKL is available for download here: https://software.intel.com/en-us/non-commercial-software-development ) . A few observations based on trying FFT for "imagenet-like" nets: 1) fft-based convolution makes sense only when (kernel_size/stride is > 4) since overhead is quite large. There is a switch fft_on in the code so you can experiment with different ratios. 2) There is overhead related to doing FFT for weights, so it's better to use FFT for large batches 3) The memory overhead is significant. 4) Current implementation of FFT does not utilize completely all cores, so I added OpenMP to speed it up. You can switch openmp off in Makefile, if you don't want it. You can get fft version "git clone -b fft https://github.com/borisgin/caffe/". |
It's not strange that you choose to prototype it with MKL. Based on your email, I guess you're working for Intel. Do your think it's possible to enable the FFT library to seamlessly switch between MKL, FFTW, cuFFT and clFFT by wrapping them in a unified API using the Adaptor design pattern? BTW, it would be cleaner to put the new implementation in FFTConvolutionalLayer. I strongly expect that you open a pull request. |
Thanks for working on this FFT-based convolution! To get an idea of the speed, sometimes we benchmark custom conv layer implementations by timing the convolution for each stage of Alexnet. A more broad sweep of the design space would be interesting, but testing the convolution speed on Alexnet is usually a good start. Before we get too gung-ho about the FFT performance... how fast is it? |
Bonus points if we can get gflops/s or similar perf metrics for a vanilla CPU convolution (e.g. Caffe's default CPU conv) compared to the new CPU FFT conv code. |
I used MKL becasue it is fastest , and becasue it is free for academy :). I will wrap-up FFT into some API, so you can switch between MKL and FFTW in the same way as we switch between different BLAS libraries. |
Please tale a look also at the PR merged comments about issue on multi backend batched gemm Theano/Theano#1870 |
Hi, I pushed Convolutional layer with fft-based Forward() . There is no FFT support in Backward() yet.
|
How fast? |
Based on current CPU implementation (FFT+openMP), my impression is that FFT - based convolutional layer makes sense only for large kernels ( kernel_size / stride >= 10). There are more details on benchmark below. More details on benchmarking:
layer kernel input output base,sec fft,sec conv1 15 128x3x242x242 128x96x228x228 79 44 conv1 13 128x3x244x244 128x96x232x232 58 45 conv1 11 128x3x246x246 128x96x236x236 44 41 conv1 9 128x3x248x248 128x96x240x240 34 43 |
What is the speedup relative to this benchmark? |
The bench, described in "Even faster convolutional code, compares two GPU implementations: cuda-fft vs cuda-convnet. I run the benchmark based on imagenet on CPU, using MKL implementation both for fftw and for gemm. |
Yes i know. What i mean is that probably that there some speedup are given by GPU batched version of FFT and GEMM. Seems that there is also a small speedup factor (1.34x) for small one at: |
I did try FFTW batch implementation which should be CPU version of batch FFT, but I did not see any benefit from it. So instead I parallelized FFTs between different cores through openMP. |
@borisgin, you have implemented two versions. They are quite worthwhile to open a PR so that further discussions can move on with your codes. |
Based on Vtune, it looks like current C++ implementation of complex operation is not very effecient, and that there is some big potential for speed-up. I want to do some extra performance tuning before opening PR |
Hi, I improved FFT speed by 2x, by replacing standard C++ implementation of complex multiplication. |
@forresti is working on low memory convolution. But the run time of the convolutional layer is still the biggest bottleneck of the network. Shall we investigate applying fast convolution such as FFT[1]?
[1] Michael Mathieu, Mikael Henaff, Yann LeCun. Fast Training of Convolutional Networks through FFTs. arXiv:1312.5851 [cs.CV]. 2013.
The text was updated successfully, but these errors were encountered: