-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Be aware of the competing fast GPU neural network library CXXNET #382
Comments
Actually, Caffe achieves comparable speed by classifying 395 images per second or ~35 million per day on a GTX 780. If I'm not mistaken the 780 actually has a higher clock speed (875mhz) and memory transfer rate (336gb/s) than the K20 (705mhz and 208gb/s), and even the K40 with default settings (745mhz and 288gb/s). With the highest boost clock setting the K40 speed is 875mhz, and this is the setting we choose for our benchmarks, although it was never clear to me if that is a peak or sustained speed. |
In fact, CXXNET relies in cublas in the same way Caffe does, even the convolutions are implemented the same way, using a im2col and then matrix multiplications. It seems to me that the authors were inspired by Caffe. PD: Caffe can classify 500 images per second on a Titan or on a K40 at full speed. |
Considering the hardware specification of GTX 780 and [Tesla K40](http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf], there is no big difference in speed. There is no doubt that the authors borrowed from Caffe. But some parts of CXXNET are indeed good enough to learn from. Just to name a few examples, the element wise operations, the all encompassing main function which corresponds to the isolated Caffe tools, the unified model config file which is a still todo task here, the data class which is perhaps what our DataSource should be, and the layers using the concise tensor api. Even if its implementation is not going to be borrowed back, it reminds us others are quickly catching up. |
I like how minimal everything is in cxxnet. We should consider using mshadow or an mshadow-like approach and not have separate cpu/gpu code for all layers. |
I like the idea of cxxnet too. In fact, I sort of wanted to write a tensor Yangqing On Tue, May 6, 2014 at 10:58 AM, Sergey Karayev notifications@github.comwrote:
|
I was bought to this thread by @kloudkl . Bing and I are glad that mshadow and cxxnet is being noticed. Indeed we learned from caffe when implementing cxxnet, specifically, the im2col way to do convolution, which was new to us before we learned from caffe. There should not be significant speed difference between the two implementations, though cxxnet use de-packing and packing multiple images at a time to do conv, which I don’t know if is already supported in most recent version of caffe. I would like to advertise mshadow a bit:) MShadow itself is also concise, with 3k lines of code and only 4 CUDA kernels so far, due to use of expression template. It would be great if some part of caffe could also use mshadow. Because mshadow accepts plugin pointer and run, this could easily be done without replacing the blob structure, while allowing writing expressions in update rule, layers. |
Since this February, there have been a "(convolutional) neural network toolkit" CXXNET based on the "Lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA" mshadow. The toolkit is able to classify 400 images per second, i.e. about 35 million per day, on a GTX 780 GPU. It seems to be faster than Caffe which can process 20 million per day on a K20 and 40 million per day on a K40.
Since CXXNET is using the tensor library, its code is also much more concise than Caffe's.
The text was updated successfully, but these errors were encountered: