Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

kloudkl · 2014-05-14T08:23:27Z

This PR wraps the math functions as suggested by both Yangqing in #382 and jeffdonahue in #408 to abstract the device type from the algorithms.

jeffdonahue · 2014-05-14T20:03:52Z

Hmm...it looks like you're trying to abstract away the CPU/GPU distinction so we might not have to write separate code for CPU/GPU. This would be nice, but I'm not really sure it's feasible through something like this, and it would have to come at no (or very minimal) cost to performance. I might be wrong though -- if you think we really can abstract away the distinction without incurring performance costs, feel free to continue down this path.

What I meant in my comment in #408 was simply to move all CUDA-specific functionality in the main code into wrapper classes (so that these wrappers could then be reimplemented in OpenCL) -- basically continuing in the spirit of math_functions.cu (and using these already available functions where they aren't used in the code, eg cudaMemcpy instead of caffe_gpu_copy) so that the main codebase does not contain any cuda calls.

bhack · 2014-05-14T21:05:37Z

Remember also that opencl have CPU support like pocl, AMD and INTEL CPU sdk/backend.

kloudkl · 2014-05-15T09:20:53Z

This method does work as expected at least for the ConcatLayer.

There is one caveat using platform independent version of Forward and Backward. The mode has to be set before the layer is constructed. It is not the case in the original test_concat_layer.cpp. The TestCPUNum for the double type is acually constructed in GPU mode set by the TestGPUGradient for the float type. Then the mode is set to CPU. This caused mutable_data calling mutable_cpu_data and this->math_.copy calling caffe_gpu_mode since this->math_ is initialized in the constructor of Layer.

 -  ConcatLayer<TypeParam> layer(layer_param);
    Caffe::set_mode(Caffe::CPU);
 +  ConcatLayer<TypeParam> layer(layer_param);

I did not choose to call MathBackendFactory::GetMathBackend on the fly because the mode will probably change during the life time of a layer object. Locking a layer object in a mode when it is created makes it device type safe. Maybe the device type should be added to the constructor's parameter list or the template parameters of the layers.

kloudkl · 2014-05-24T15:58:10Z

@jeffdonahue was right. I should "move all CUDA-specific functionality in the main code into wrapper classes".

kloudkl · 2014-05-26T02:08:50Z

The CPU/GPU versions of the Forward/Backward methods of all the layers that don't use kernels in their GPU version of these methods have been unified by using the device wrapper classes. Duplicate codes have been greatly eliminated.

kloudkl · 2014-06-02T12:15:46Z

Since I don't have access to GPU right now, only CPU codes can be tested. The result of build/test/test_all.testbin --gtest_filter="*CPU*" &> test.log is as follows.

[----------] Global test environment tear-down
[==========] 169 tests from 51 test cases ran. (14848 ms total)
[  PASSED  ] 169 tests.

  YOU HAVE 2 DISABLED TESTS

Welcome anyone interested in this feature help me run all the tests. You will need to install hub. Checkout this PR by running hub checkout https://github.com/BVLC/caffe/pull/415 device_wrapper. Then build and run the tests as usual.

kloudkl · 2014-06-02T12:19:48Z

I don't think it's worth the complication to further merge the layers that use CUDA kernel functions in their {Forward, Backward}_gpu methods.

So this PR is done and ready to be reviewed.

kloudkl · 2014-06-03T01:13:12Z

With regard to #408, it appears more difficult to also unify the API of clBLAS. Take the single precision general matrix-matrix multiplication as an example, the clBLAS API is quite different from the BLAS/cuBLAS APIs since it adds many extra parameters and uses cl_mem instead of float* to pass in arrays.
clBLAS

clblasStatus    clblasSgemm (
   clblasOrder order, clblasTranspose transA, clblasTranspose transB, size_t M, 
   size_t N, size_t K, cl_float alpha, const cl_mem A, size_t offA, size_t lda, 
   const cl_mem B, size_t offB,size_t ldb, cl_float beta, cl_mem C, size_t offC, 
   size_t ldc, cl_uint numCommandQueues, cl_command_queue *commandQueues, 
   cl_uint numEventsInWaitList, const cl_event *eventWaitList, cl_event *events)

BLAS

void cblas_sgemm (
   const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
   const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, 
   const float alpha, const float *A, const int lda, const float *B, const int ldb, 
   const float beta, float *C, const int ldc);

cuBLAS

cublasStatus_t cublasSgemm(
   cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m,
   int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, 
   const float *beta, float *C, int ldc)

cublasXt

cublasStatus_t cublasXtMgSgemm(
   cublasXtHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, 
   size_t m, size_t n, size_t k, const float *alpha, const float *A, int lda,
   const float *B, int ldb, const float *beta, float *C, int ldc)

bhack · 2014-06-03T08:11:06Z

Please take a look how it is tried to be managed here (with generic_blas and CUDA and opencl file): https://github.com/Theano/libgpuarray/tree/master/src

kloudkl · 2014-06-08T14:07:37Z

The OpenCLDevice methods are directly inspired by the implementations of Theano/libgpuarray.
TODO lists:

Building with both the make and the CMake scripts. Intel OpenCL SDK cann't be installed on Ubuntu 14.04. The OpenCL codes haven't been compiled even on the CPU.
Document the installation of OpenCL.
Tests.
Integrate with the algorithms.

bhack · 2014-06-08T14:27:52Z

If you have a supported intel platform you could test on gpu with the official Intel opensource implementation:
http://www.freedesktop.org/wiki/Software/Beignet/

For Intel sdk and ubuntu 14.04
http://stackoverflow.com/questions/23420814/ubuntu-14-04-opencl-intel-sdk-error

AMD opencl SDK work also on cpu
http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/

kloudkl · 2014-06-08T14:32:14Z

When installing opencl_runtime_14.1_x64_4.4.0.117.tgz and intel_sdk_for_ocl_applications_2014_ubuntu_4.4.0.117_x64.tgz on Ubuntu 14.04, the package management system cann't identify the installed deb files.

sudo dpkg -i *.deb
(Reading database ... 217900 files and directories currently installed.)
Preparing to unpack opencl-1.2-base-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-base (1.2-4.4.0.117) ...
dpkg: error processing archive opencl-1.2-base-4.4.0.117-1.x86_64.deb (--install):
 trying to overwrite '/opt/intel/opencl-1.2-4.4.0.117/lib64/libOpenCL.so.1.2', which is also in package opencl-1.2-base 4.4.0.117-2
Preparing to unpack opencl-1.2-devel-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-devel (1.2-4.4.0.117) over (1.2-4.4.0.117) ...
Preparing to unpack opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-intel-cpu (1.2-4.4.0.117) ...
dpkg: error processing archive opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb (--install):
 trying to overwrite '/opt/intel/opencl-1.2-4.4.0.117/doc/llvm_release_license.txt', which is also in package opencl-1.2-intel-cpu 4.4.0.117-2
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Preparing to unpack opencl-1.2-intel-devel-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-intel-devel (1.2-4.4.0.117) over (1.2-4.4.0.117) ...
Preparing to unpack opencl-1.2-intel-devel-android-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-intel-devel-android (1.2-4.4.0.117) over (1.2-4.4.0.117) ...
dpkg: dependency problems prevent configuration of opencl-devel:
 opencl-devel depends on opencl-base (>= 1.2-4.4.0.117); however:
  Package opencl-base is not installed.

dpkg: error processing package opencl-devel (--install):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of opencl-intel-devel:
 opencl-intel-devel depends on opencl-base (>= 1.2-4.4.0.117); however:
  Package opencl-base is not installed.
 opencl-intel-devel depends on opencl-intel-cpu (>= 1.2-4.4.0.117); however:
  Package opencl-intel-cpu is not installed.

dpkg: error processing package opencl-intel-devel (--install):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of opencl-intel-devel-android:
 opencl-intel-devel-android depends on opencl-base (>= 1.2-4.4.0.117); however:
  Package opencl-base is not installed.
 opencl-intel-devel-android depends on opencl-intel-cpu (>= 1.2-4.4.0.117); however:
  Package opencl-intel-cpu is not installed.

dpkg: error processing package opencl-intel-devel-android (--install):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 opencl-1.2-base-4.4.0.117-1.x86_64.deb
 opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb
 opencl-devel
 opencl-intel-devel
 opencl-intel-devel-android

Someone said that they could be installed on Ubuntu 13.04 and 12.04.

kloudkl · 2014-06-08T14:34:43Z

My laptop only has Intel CPU. Is AMD SDK effective?

bhack · 2014-06-08T14:54:38Z

Seems that you are trying to install two version: opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb

that conflict with:
opencl-1.2-intel-cpu-4.4.0.117-2.x86_64.deb

Why you have two versions?

As i can remember AMD SDK works on X86 CPU with SSE 2.x or later (also non amd CPU)

Beignet actually works on Intel Ivy Bridge gpu. So if you have an Ivy Bridge laptop you can test also beignet.

kloudkl · 2014-06-18T05:32:15Z

Obviously, AMD provides more flexible cross-platform OpenCL SDK to survive in the market while Intel does not bother taking care of other vendors.

bhack · 2014-06-18T09:25:28Z

Yes but generally the ICD loader let you to have multivendor/multi-implementation (and multidevice) with max flexibility for the user http://wiki.tiker.net/OpenCLHowTo

robwhess · 2014-06-24T00:51:33Z

Can anyone comment on the status of this PR? I'm not so much interested in the OpenCL stuff itself, but the abstraction here is nice and would make it easier to modify the build process to compile only CPU code if desired, which I am interested in. If this PR might be merged soon, I could branch off of it to start that work on the build process.

cypof · 2014-06-24T01:02:34Z

@robwhess +1

huyng · 2014-06-24T01:02:42Z

+1 @robwhess on this. Would be glad to help out with the testing

shelhamer · 2014-06-24T05:23:15Z

@Yangqing @jeffdonahue let's take a look at this after CVPR and see if we can bring this to a nice, abstract conclusion.

@robwhess I agree the CPU/GPU split progress is important. If you'd like to help review this in light of the work you have planned, please do comment inline and we'll see if this can be merged soon.

robwhess · 2014-06-24T22:33:40Z

My GPU tests fail poorly on this branch when using CUDA:

...
[----------] 9 tests from ConvolutionLayerTest/1, where TypeParam = double
[ RUN      ] ConvolutionLayerTest/1.TestSetup
[       OK ] ConvolutionLayerTest/1.TestSetup (0 ms)
[ RUN      ] ConvolutionLayerTest/1.TestCPUSimpleConvolution
[       OK ] ConvolutionLayerTest/1.TestCPUSimpleConvolution (0 ms)
[ RUN      ] ConvolutionLayerTest/1.TestGPUSimpleConvolution
make: *** [runtest] Bus error: 10

I'm digging in to see what's causing this problem.

Note that I'm not using OpenCL (and don't even have clBLAS installed). To get the code to compile, I had to add (or uncomment) #ifdef USE_OPENCL ... #endif blocks around the entirety of the following files:

src/caffe/opencl_syncedmem.cpp
src/caffe/test/test_opencl_math_functions.cpp
src/caffe/test/test_opencl_syncedmem.cpp
src/caffe/util/opencl_device.cpp
src/caffe/util/opencl_math_functions.cpp

I also had to add an #ifdef USE_OPENCL ... #endif around the #include "caffe/opencl_syncedmem.hpp" line in include/caffe/syncedmem_factory.hpp. Similar #ifdefs should probably be included in the final pull.

robwhess · 2014-06-24T23:03:55Z

I'm also occasionally getting a nasty crash running the tests that hangs my whole machine (Mac OS X 10.9) and requires a hard reboot. I managed to capture this stack trace before total freeze:

...
[----------] 9 tests from ConvolutionLayerTest/1, where TypeParam = double
[ RUN      ] ConvolutionLayerTest/1.TestSetup
[       OK ] ConvolutionLayerTest/1.TestSetup (0 ms)
[ RUN      ] ConvolutionLayerTest/1.TestGPUSimpleConvolution
F0624 15:54:26.503756 2042675984 syncedmem.cpp:35] Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure
*** Check failure stack trace: ***
    @        0x108ffba8a  google::LogMessage::Fail()
    @        0x108fface8  google::LogMessage::SendToLog()
    @        0x108ffb73a  google::LogMessage::Flush()
    @        0x108fff0f8  google::LogMessageFatal::~LogMessageFatal()
    @        0x108ffbf25  google::LogMessageFatal::~LogMessageFatal()
    @        0x10399cebc  caffe::SyncedMemory::to_cpu()
    @        0x10399cc1f  caffe::SyncedMemory::cpu_data()
    @        0x1039547e7  caffe::Blob<>::cpu_data()
    @        0x1037fc490  caffe::ConvolutionLayerTest_TestGPUSimpleConvolution_Test<>::TestBody()
    @        0x10392319c  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x1039133aa  testing::Test::Run()
    @        0x1039142f2  testing::TestInfo::Run()
    @        0x1039149c0  testing::TestCase::Run()
    @        0x103919f07  testing::internal::UnitTestImpl::RunAllTests()
    @        0x103923a94  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x103919c19  testing::UnitTest::Run()
    @        0x1037cc669  main
    @     0x7fff8a3fb5fd  start
make: *** [runtest] Abort trap: 6

(I'm assuming this is related to the freeze.)

This may be related to the bus error I was seeing in the comment above, based on the fact that it's occurring during the same test.

Still digging in to find the root of this problem.

kloudkl · 2014-06-28T05:23:47Z

I'm testing this and fixing those bugs right now. After all the tests pass, I will diff this branch with yours.

robwhess · 2014-06-28T06:17:30Z

OK, but I don't quite understand why you want to replicate work I've already done. Is there a reason you don't want to start fresh from my branch? This seems to be the path of least resistance going forward, since my branch is already tested and working on a CUDA machine (it sounds like you don't have a CUDA machine to test with), and I've cherry-picked around your OpenCL commits to make a clean revision history without OpenCL code. All you'd need to do is take the additional abstractions you made and the on-the-fly device checking and commit those things the new branch (or I could do that). This should be easy compared to examining a hundred calls to {const,mutable}_data() and deciding which ones should be cpu_data() and vice versa, which I've already done.

Either way, if you want to stick with this branch, can you please make me a collaborator on it so it's easier for me to contribute to this PR going forward? Like I said above, if we decide to switch to my branch, we can either do it as a fork in your repo with me as a collaborator, or I can fork into a repo under my account and make you a collaborator there.

kloudkl · 2014-06-28T07:15:16Z

@robwhess, I have just made all the tests pass. As you said, some of the operations can only be conducted on the CPU or the GPU pointers. There are too many traps about the exact states of each pointer and each operation. It is only controllable when the layers are aware of which mode they are running in.

You should open a PR so that @shelhamer can setup a feature branch. Then the layer-wise mode awareness will be added by you or someone else. I will rebase and focus on OpenCL staff in this one.

shelhamer · 2014-06-28T07:35:40Z

@kloudkl @robwhess going from your latest conversation I've promoted flickr:device-abstraction to BVLC:device-abstraction so that you and any other interested developers can work on it there and PR to it. Thanks both of you for all your work on this so far.

Branch at https://github.com/BVLC/caffe/tree/device-abstraction

shelhamer · 2014-06-28T07:40:56Z

Heads-up that #555 might simplify the CPU / GPU split... or it might just lead to more rebasing.

robwhess · 2014-06-30T21:16:23Z

Thanks @shelhamer. @kloudkl, sorry, I should have realized your primary goal here was the OpenCL part. I'm going to continue working on the device abstraction in my own branch. I'll start by pulling in the additional abstractions @kloudkl made to layers, Solver, and Net. I'll work via a PR into BVLC:device-abstraction, which I'll start shortly.

shelhamer · 2014-06-30T22:05:51Z

@robwhess thanks for working on the device abstraction. Note that #555 is planned for merge ahead of this, so device-abstraction will have to be adapted to the new memory interface.

As a side note, I am slightly worried by d7014d9. If the rebase went through with all the conflicts resolved properly then there should be nothing left to fix up at the end. Although it is a comfort that you have the tests passing.

robwhess · 2014-06-30T22:43:43Z

@shelhamer d7014d9 was my fault. The conflicts were so tedious and time consuming (several hours) that I missed a few things and had to go back and fix them.

I will wait until #555 is merged and then rebase to adapt device-abstraction.

shelhamer · 2014-07-04T00:42:20Z

Note the CPU / GPU split portion of this work is now being carried out at #610.

bhack · 2014-11-26T10:26:52Z

Please consider also this news: Arrayfire is now under BSD

https://github.com/arrayfire/arrayfire

kloudkl · 2014-12-01T10:49:18Z

Months later, I think that distributed training is much more important than cross-device compatibility. Many businesses cannot wait for two or more weeks to train a model even many models can be trained at the same time. It is very highly demanded to train a model using millions or more samples in a single day. Although it is still desirable to deploy the same model on multiple types of devices.

hughperkins · 2015-01-01T10:05:04Z

Reading through this thread, and 408, is it not entirely unreasonable to assume the migrating caffe to work with OpenCL would be both considerable amount of work, potentially conflict with several CUDA optimizations, and generally go 'against the flow' of what other Caffe contributors are looking to achieve? I think this is the case, and that's why I'm rather writing an OpenCL convolutional network library 'from the ground up', at https://github.com/hughperkins/ClConvolve , but just touching base, in case my current approach is a bit too 'not invented here'?

momer · 2015-02-11T07:35:25Z

@hughperkins Looks like an interesting project coming together.

All: given that @kloudkl will no longer be contributing, is this PR still under review, or is this specific effort of bringing some abstraction / opencl support to the project and ready to be closed?

hughperkins · 2015-02-11T12:40:30Z

@momer: Thank-you momer :-)

shelhamer · 2015-02-21T02:48:37Z

@hughperkins @momer this PR is still open to remind us about the effort to abstract devices. Until this effort is revived it'll stay here as a placeholder and example of one approach. Once the abstraction to host both CUDA and OpenCL implementations arrives then @hughperkins layer implementations could be helpful!

shelhamer · 2015-03-09T22:06:20Z

Closing as this has been carried on in #610 which will itself be replaced by a master edition for resurrection someday.

kloudkl mentioned this pull request May 14, 2014

What is the standard code style formatter? #416

Closed

shelhamer assigned jeffdonahue May 21, 2014

kloudkl changed the title ~~Wrap the CPU and GPU math functions in math backend classes~~ Wrap the CPU and GPU specific functions in device wrapper classes May 26, 2014

kloudkl mentioned this pull request Jun 2, 2014

Caffe::set_random_seed was executed 569736 times for the CPU tests #468

Closed

kloudkl changed the title ~~Wrap the CPU and GPU specific functions in device wrapper classes~~ Unify the CPU, GPU and OpenCL math functions API in the device wrapper classes Jun 8, 2014

kloudkl changed the title ~~Unify the CPU, GPU and OpenCL math functions API in the device wrapper classes~~ Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes Jun 8, 2014

kloudkl mentioned this pull request Jun 8, 2014

Implement WARPLossLayer #257

Closed

kloudkl mentioned this pull request Jun 18, 2014

add option for lmdb #431

Merged

shelhamer mentioned this pull request Jun 30, 2014

CPU-only build #561

Merged

6 tasks

robwhess mentioned this pull request Jul 2, 2014

Implement device abstraction for remaining classes #587

Merged

shelhamer mentioned this pull request Jul 4, 2014

Device Abstraction #610

Closed

kloudkl mentioned this pull request Jul 9, 2014

Support CPU only memcpy #633

Merged

bhack mentioned this pull request Aug 16, 2014

gpu-ocelot Caffe build #939

Closed

shelhamer force-pushed the dev branch 3 times, most recently from 4278286 to c01f07a Compare August 28, 2014 07:00

shelhamer force-pushed the dev branch from 64258b6 to 403b56b Compare September 19, 2014 04:38

shelhamer force-pushed the dev branch from d8eb4df to 914da95 Compare October 8, 2014 16:36

sergeyk force-pushed the dev branch from 2fb4c97 to 1718903 Compare October 17, 2014 18:44

shelhamer added the abandoned label Mar 9, 2015

shelhamer closed this Mar 9, 2015

bhack mentioned this pull request Mar 31, 2015

OpenCL Backend #2195

Closed

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

Conversation

kloudkl commented May 14, 2014

jeffdonahue commented May 14, 2014

bhack commented May 14, 2014

kloudkl commented May 15, 2014

kloudkl commented May 24, 2014

kloudkl commented May 26, 2014

kloudkl commented Jun 2, 2014

kloudkl commented Jun 2, 2014

kloudkl commented Jun 3, 2014

bhack commented Jun 3, 2014

kloudkl commented Jun 8, 2014

bhack commented Jun 8, 2014

kloudkl commented Jun 8, 2014

kloudkl commented Jun 8, 2014

bhack commented Jun 8, 2014

kloudkl commented Jun 18, 2014

bhack commented Jun 18, 2014

robwhess commented Jun 24, 2014

cypof commented Jun 24, 2014

huyng commented Jun 24, 2014

shelhamer commented Jun 24, 2014

robwhess commented Jun 24, 2014

robwhess commented Jun 24, 2014

kloudkl commented Jun 28, 2014

robwhess commented Jun 28, 2014

kloudkl commented Jun 28, 2014

shelhamer commented Jun 28, 2014

shelhamer commented Jun 28, 2014

robwhess commented Jun 30, 2014

shelhamer commented Jun 30, 2014

robwhess commented Jun 30, 2014

shelhamer commented Jul 4, 2014

bhack commented Nov 26, 2014

kloudkl commented Dec 1, 2014

hughperkins commented Jan 1, 2015

momer commented Feb 11, 2015

hughperkins commented Feb 11, 2015

shelhamer commented Feb 21, 2015

shelhamer commented Mar 9, 2015