Be aware of the competing fast GPU neural network library CXXNET #382

kloudkl · 2014-05-03T14:39:36Z

Since this February, there have been a "(convolutional) neural network toolkit" CXXNET based on the "Lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA" mshadow. The toolkit is able to classify 400 images per second, i.e. about 35 million per day, on a GTX 780 GPU. It seems to be faster than Caffe which can process 20 million per day on a K20 and 40 million per day on a K40.
Since CXXNET is using the tensor library, its code is also much more concise than Caffe's.

shelhamer · 2014-05-05T02:38:03Z

Actually, Caffe achieves comparable speed by classifying 395 images per second or ~35 million per day on a GTX 780.

If I'm not mistaken the 780 actually has a higher clock speed (875mhz) and memory transfer rate (336gb/s) than the K20 (705mhz and 208gb/s), and even the K40 with default settings (745mhz and 288gb/s). With the highest boost clock setting the K40 speed is 875mhz, and this is the setting we choose for our benchmarks, although it was never clear to me if that is a peak or sustained speed.

sguada · 2014-05-05T05:20:42Z

In fact, CXXNET relies in cublas in the same way Caffe does, even the convolutions are implemented the same way, using a im2col and then matrix multiplications. It seems to me that the authors were inspired by Caffe.

PD: Caffe can classify 500 images per second on a Titan or on a K40 at full speed.

kloudkl · 2014-05-05T12:02:00Z

Considering the hardware specification of GTX 780 and [Tesla K40](http://www.nvidia.com/content/PDF/kepler/Tesla-K40-Active-Board-Spec-BD-06949-001_v03.pdf], there is no big difference in speed.

There is no doubt that the authors borrowed from Caffe. But some parts of CXXNET are indeed good enough to learn from. Just to name a few examples, the element wise operations, the all encompassing main function which corresponds to the isolated Caffe tools, the unified model config file which is a still todo task here, the data class which is perhaps what our DataSource should be, and the layers using the concise tensor api. Even if its implementation is not going to be borrowed back, it reminds us others are quickly catching up.

sergeyk · 2014-05-06T17:58:18Z

I like how minimal everything is in cxxnet. We should consider using mshadow or an mshadow-like approach and not have separate cpu/gpu code for all layers.

Yangqing · 2014-05-07T18:33:26Z

I like the idea of cxxnet too. In fact, I sort of wanted to write a tensor
interface but then the quick rewriting back in November led to a (crappy)
matrix library: you can see that there are switches everywhere that just
calls either caffe_gpu_* or caffe_cpu_*. If someone wants to give it a
cleaning try that would be great, but it will mostly be simply refactor
codes. (note that cxxnet just hides "ugly" cpu and gpu separations deeper
in mshadow :)). Speedwise things won't be much different if one uses the
same blas library.

Yangqing

On Tue, May 6, 2014 at 10:58 AM, Sergey Karayev notifications@github.comwrote:

I like how minimal everything is in cxxnet. We should consider using
mshadow or an mshadow-like approach and not have separate cpu/gpu code for
all layers.

Reply to this email directly or view it on GitHubhttps://github.com//issues/382#issuecomment-42336550
.

tqchen · 2014-05-12T17:37:33Z

I was bought to this thread by @kloudkl . Bing and I are glad that mshadow and cxxnet is being noticed. Indeed we learned from caffe when implementing cxxnet, specifically, the im2col way to do convolution, which was new to us before we learned from caffe.

There should not be significant speed difference between the two implementations, though cxxnet use de-packing and packing multiple images at a time to do conv, which I don’t know if is already supported in most recent version of caffe.

I would like to advertise mshadow a bit:) MShadow itself is also concise, with 3k lines of code and only 4 CUDA kernels so far, due to use of expression template. It would be great if some part of caffe could also use mshadow. Because mshadow accepts plugin pointer and run, this could easily be done without replacing the blob structure, while allowing writing expressions in update rule, layers.

This was referenced May 14, 2014

Add support for opencl #408

Closed

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

Closed

Yangqing closed this as completed Jun 6, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Be aware of the competing fast GPU neural network library CXXNET #382

Be aware of the competing fast GPU neural network library CXXNET #382

kloudkl commented May 3, 2014

shelhamer commented May 5, 2014

sguada commented May 5, 2014

kloudkl commented May 5, 2014

sergeyk commented May 6, 2014

Yangqing commented May 7, 2014

tqchen commented May 12, 2014

Be aware of the competing fast GPU neural network library CXXNET #382

Be aware of the competing fast GPU neural network library CXXNET #382

Comments

kloudkl commented May 3, 2014

shelhamer commented May 5, 2014

sguada commented May 5, 2014

kloudkl commented May 5, 2014

sergeyk commented May 6, 2014

Yangqing commented May 7, 2014

tqchen commented May 12, 2014