Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

Closed
wants to merge 75 commits into from
Closed

Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415

wants to merge 75 commits into from

Conversation

kloudkl
Copy link
Contributor

@kloudkl kloudkl commented May 14, 2014

This PR wraps the math functions as suggested by both Yangqing in #382 and jeffdonahue in #408 to abstract the device type from the algorithms.

@jeffdonahue
Copy link
Contributor

Hmm...it looks like you're trying to abstract away the CPU/GPU distinction so we might not have to write separate code for CPU/GPU. This would be nice, but I'm not really sure it's feasible through something like this, and it would have to come at no (or very minimal) cost to performance. I might be wrong though -- if you think we really can abstract away the distinction without incurring performance costs, feel free to continue down this path.

What I meant in my comment in #408 was simply to move all CUDA-specific functionality in the main code into wrapper classes (so that these wrappers could then be reimplemented in OpenCL) -- basically continuing in the spirit of math_functions.cu (and using these already available functions where they aren't used in the code, eg cudaMemcpy instead of caffe_gpu_copy) so that the main codebase does not contain any cuda calls.

@bhack
Copy link
Contributor

bhack commented May 14, 2014

Remember also that opencl have CPU support like pocl, AMD and INTEL CPU sdk/backend.

@kloudkl
Copy link
Contributor Author

kloudkl commented May 15, 2014

This method does work as expected at least for the ConcatLayer.

There is one caveat using platform independent version of Forward and Backward. The mode has to be set before the layer is constructed. It is not the case in the original test_concat_layer.cpp. The TestCPUNum for the double type is acually constructed in GPU mode set by the TestGPUGradient for the float type. Then the mode is set to CPU. This caused mutable_data calling mutable_cpu_data and this->math_.copy calling caffe_gpu_mode since this->math_ is initialized in the constructor of Layer.

 -  ConcatLayer<TypeParam> layer(layer_param);
    Caffe::set_mode(Caffe::CPU);
 +  ConcatLayer<TypeParam> layer(layer_param);

I did not choose to call MathBackendFactory::GetMathBackend on the fly because the mode will probably change during the life time of a layer object. Locking a layer object in a mode when it is created makes it device type safe. Maybe the device type should be added to the constructor's parameter list or the template parameters of the layers.

@kloudkl
Copy link
Contributor Author

kloudkl commented May 24, 2014

@jeffdonahue was right. I should "move all CUDA-specific functionality in the main code into wrapper classes".

@kloudkl
Copy link
Contributor Author

kloudkl commented May 26, 2014

The CPU/GPU versions of the Forward/Backward methods of all the layers that don't use kernels in their GPU version of these methods have been unified by using the device wrapper classes. Duplicate codes have been greatly eliminated.

@kloudkl kloudkl changed the title Wrap the CPU and GPU math functions in math backend classes Wrap the CPU and GPU specific functions in device wrapper classes May 26, 2014
@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 2, 2014

Since I don't have access to GPU right now, only CPU codes can be tested. The result of build/test/test_all.testbin --gtest_filter="*CPU*" &> test.log is as follows.

[----------] Global test environment tear-down
[==========] 169 tests from 51 test cases ran. (14848 ms total)
[  PASSED  ] 169 tests.

  YOU HAVE 2 DISABLED TESTS

Welcome anyone interested in this feature help me run all the tests. You will need to install hub. Checkout this PR by running hub checkout https://github.com/BVLC/caffe/pull/415 device_wrapper. Then build and run the tests as usual.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 2, 2014

I don't think it's worth the complication to further merge the layers that use CUDA kernel functions in their {Forward, Backward}_gpu methods.

So this PR is done and ready to be reviewed.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 3, 2014

With regard to #408, it appears more difficult to also unify the API of clBLAS. Take the single precision general matrix-matrix multiplication as an example, the clBLAS API is quite different from the BLAS/cuBLAS APIs since it adds many extra parameters and uses cl_mem instead of float* to pass in arrays.
clBLAS

clblasStatus    clblasSgemm (
   clblasOrder order, clblasTranspose transA, clblasTranspose transB, size_t M, 
   size_t N, size_t K, cl_float alpha, const cl_mem A, size_t offA, size_t lda, 
   const cl_mem B, size_t offB,size_t ldb, cl_float beta, cl_mem C, size_t offC, 
   size_t ldc, cl_uint numCommandQueues, cl_command_queue *commandQueues, 
   cl_uint numEventsInWaitList, const cl_event *eventWaitList, cl_event *events)

BLAS

void cblas_sgemm (
   const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
   const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, 
   const float alpha, const float *A, const int lda, const float *B, const int ldb, 
   const float beta, float *C, const int ldc);

cuBLAS

cublasStatus_t cublasSgemm(
   cublasHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, int m,
   int n, int k, const float *alpha, const float *A, int lda, const float *B, int ldb, 
   const float *beta, float *C, int ldc)

cublasXt

cublasStatus_t cublasXtMgSgemm(
   cublasXtHandle_t handle, cublasOperation_t transa, cublasOperation_t transb, 
   size_t m, size_t n, size_t k, const float *alpha, const float *A, int lda,
   const float *B, int ldb, const float *beta, float *C, int ldc)

@bhack
Copy link
Contributor

bhack commented Jun 3, 2014

Please take a look how it is tried to be managed here (with generic_blas and CUDA and opencl file): https://github.com/Theano/libgpuarray/tree/master/src

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 8, 2014

The OpenCLDevice methods are directly inspired by the implementations of Theano/libgpuarray.
TODO lists:

  1. Building with both the make and the CMake scripts. Intel OpenCL SDK cann't be installed on Ubuntu 14.04. The OpenCL codes haven't been compiled even on the CPU.
  2. Document the installation of OpenCL.
  3. Tests.
  4. Integrate with the algorithms.

@kloudkl kloudkl changed the title Wrap the CPU and GPU specific functions in device wrapper classes Unify the CPU, GPU and OpenCL math functions API in the device wrapper classes Jun 8, 2014
@kloudkl kloudkl changed the title Unify the CPU, GPU and OpenCL math functions API in the device wrapper classes Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes Jun 8, 2014
@bhack
Copy link
Contributor

bhack commented Jun 8, 2014

If you have a supported intel platform you could test on gpu with the official Intel opensource implementation:
http://www.freedesktop.org/wiki/Software/Beignet/

For Intel sdk and ubuntu 14.04
http://stackoverflow.com/questions/23420814/ubuntu-14-04-opencl-intel-sdk-error

AMD opencl SDK work also on cpu
http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 8, 2014

When installing opencl_runtime_14.1_x64_4.4.0.117.tgz and intel_sdk_for_ocl_applications_2014_ubuntu_4.4.0.117_x64.tgz on Ubuntu 14.04, the package management system cann't identify the installed deb files.

sudo dpkg -i *.deb
(Reading database ... 217900 files and directories currently installed.)
Preparing to unpack opencl-1.2-base-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-base (1.2-4.4.0.117) ...
dpkg: error processing archive opencl-1.2-base-4.4.0.117-1.x86_64.deb (--install):
 trying to overwrite '/opt/intel/opencl-1.2-4.4.0.117/lib64/libOpenCL.so.1.2', which is also in package opencl-1.2-base 4.4.0.117-2
Preparing to unpack opencl-1.2-devel-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-devel (1.2-4.4.0.117) over (1.2-4.4.0.117) ...
Preparing to unpack opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-intel-cpu (1.2-4.4.0.117) ...
dpkg: error processing archive opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb (--install):
 trying to overwrite '/opt/intel/opencl-1.2-4.4.0.117/doc/llvm_release_license.txt', which is also in package opencl-1.2-intel-cpu 4.4.0.117-2
dpkg-deb: error: subprocess paste was killed by signal (Broken pipe)
Preparing to unpack opencl-1.2-intel-devel-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-intel-devel (1.2-4.4.0.117) over (1.2-4.4.0.117) ...
Preparing to unpack opencl-1.2-intel-devel-android-4.4.0.117-1.x86_64.deb ...
Unpacking opencl-intel-devel-android (1.2-4.4.0.117) over (1.2-4.4.0.117) ...
dpkg: dependency problems prevent configuration of opencl-devel:
 opencl-devel depends on opencl-base (>= 1.2-4.4.0.117); however:
  Package opencl-base is not installed.

dpkg: error processing package opencl-devel (--install):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of opencl-intel-devel:
 opencl-intel-devel depends on opencl-base (>= 1.2-4.4.0.117); however:
  Package opencl-base is not installed.
 opencl-intel-devel depends on opencl-intel-cpu (>= 1.2-4.4.0.117); however:
  Package opencl-intel-cpu is not installed.

dpkg: error processing package opencl-intel-devel (--install):
 dependency problems - leaving unconfigured
dpkg: dependency problems prevent configuration of opencl-intel-devel-android:
 opencl-intel-devel-android depends on opencl-base (>= 1.2-4.4.0.117); however:
  Package opencl-base is not installed.
 opencl-intel-devel-android depends on opencl-intel-cpu (>= 1.2-4.4.0.117); however:
  Package opencl-intel-cpu is not installed.

dpkg: error processing package opencl-intel-devel-android (--install):
 dependency problems - leaving unconfigured
Errors were encountered while processing:
 opencl-1.2-base-4.4.0.117-1.x86_64.deb
 opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb
 opencl-devel
 opencl-intel-devel
 opencl-intel-devel-android

Someone said that they could be installed on Ubuntu 13.04 and 12.04.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 8, 2014

My laptop only has Intel CPU. Is AMD SDK effective?

@kloudkl kloudkl mentioned this pull request Jun 8, 2014
@bhack
Copy link
Contributor

bhack commented Jun 8, 2014

Seems that you are trying to install two version: opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb

that conflict with:
opencl-1.2-intel-cpu-4.4.0.117-2.x86_64.deb

Why you have two versions?

As i can remember AMD SDK works on X86 CPU with SSE 2.x or later (also non amd CPU)

Beignet actually works on Intel Ivy Bridge gpu. So if you have an Ivy Bridge laptop you can test also beignet.

@kloudkl kloudkl mentioned this pull request Jun 18, 2014
@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 18, 2014

Obviously, AMD provides more flexible cross-platform OpenCL SDK to survive in the market while Intel does not bother taking care of other vendors.

@bhack
Copy link
Contributor

bhack commented Jun 18, 2014

Yes but generally the ICD loader let you to have multivendor/multi-implementation (and multidevice) with max flexibility for the user http://wiki.tiker.net/OpenCLHowTo

@robwhess
Copy link

Can anyone comment on the status of this PR? I'm not so much interested in the OpenCL stuff itself, but the abstraction here is nice and would make it easier to modify the build process to compile only CPU code if desired, which I am interested in. If this PR might be merged soon, I could branch off of it to start that work on the build process.

@cypof
Copy link
Member

cypof commented Jun 24, 2014

@robwhess +1

@huyng
Copy link
Contributor

huyng commented Jun 24, 2014

+1 @robwhess on this. Would be glad to help out with the testing

@shelhamer
Copy link
Member

@Yangqing @jeffdonahue let's take a look at this after CVPR and see if we can bring this to a nice, abstract conclusion.

@robwhess I agree the CPU/GPU split progress is important. If you'd like to help review this in light of the work you have planned, please do comment inline and we'll see if this can be merged soon.

@robwhess
Copy link

My GPU tests fail poorly on this branch when using CUDA:

...
[----------] 9 tests from ConvolutionLayerTest/1, where TypeParam = double
[ RUN      ] ConvolutionLayerTest/1.TestSetup
[       OK ] ConvolutionLayerTest/1.TestSetup (0 ms)
[ RUN      ] ConvolutionLayerTest/1.TestCPUSimpleConvolution
[       OK ] ConvolutionLayerTest/1.TestCPUSimpleConvolution (0 ms)
[ RUN      ] ConvolutionLayerTest/1.TestGPUSimpleConvolution
make: *** [runtest] Bus error: 10

I'm digging in to see what's causing this problem.

Note that I'm not using OpenCL (and don't even have clBLAS installed). To get the code to compile, I had to add (or uncomment) #ifdef USE_OPENCL ... #endif blocks around the entirety of the following files:

  • src/caffe/opencl_syncedmem.cpp
  • src/caffe/test/test_opencl_math_functions.cpp
  • src/caffe/test/test_opencl_syncedmem.cpp
  • src/caffe/util/opencl_device.cpp
  • src/caffe/util/opencl_math_functions.cpp

I also had to add an #ifdef USE_OPENCL ... #endif around the #include "caffe/opencl_syncedmem.hpp" line in include/caffe/syncedmem_factory.hpp. Similar #ifdefs should probably be included in the final pull.

@robwhess
Copy link

I'm also occasionally getting a nasty crash running the tests that hangs my whole machine (Mac OS X 10.9) and requires a hard reboot. I managed to capture this stack trace before total freeze:

...
[----------] 9 tests from ConvolutionLayerTest/1, where TypeParam = double
[ RUN      ] ConvolutionLayerTest/1.TestSetup
[       OK ] ConvolutionLayerTest/1.TestSetup (0 ms)
[ RUN      ] ConvolutionLayerTest/1.TestGPUSimpleConvolution
F0624 15:54:26.503756 2042675984 syncedmem.cpp:35] Check failed: error == cudaSuccess (4 vs. 0)  unspecified launch failure
*** Check failure stack trace: ***
    @        0x108ffba8a  google::LogMessage::Fail()
    @        0x108fface8  google::LogMessage::SendToLog()
    @        0x108ffb73a  google::LogMessage::Flush()
    @        0x108fff0f8  google::LogMessageFatal::~LogMessageFatal()
    @        0x108ffbf25  google::LogMessageFatal::~LogMessageFatal()
    @        0x10399cebc  caffe::SyncedMemory::to_cpu()
    @        0x10399cc1f  caffe::SyncedMemory::cpu_data()
    @        0x1039547e7  caffe::Blob<>::cpu_data()
    @        0x1037fc490  caffe::ConvolutionLayerTest_TestGPUSimpleConvolution_Test<>::TestBody()
    @        0x10392319c  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x1039133aa  testing::Test::Run()
    @        0x1039142f2  testing::TestInfo::Run()
    @        0x1039149c0  testing::TestCase::Run()
    @        0x103919f07  testing::internal::UnitTestImpl::RunAllTests()
    @        0x103923a94  testing::internal::HandleExceptionsInMethodIfSupported<>()
    @        0x103919c19  testing::UnitTest::Run()
    @        0x1037cc669  main
    @     0x7fff8a3fb5fd  start
make: *** [runtest] Abort trap: 6

(I'm assuming this is related to the freeze.)

This may be related to the bus error I was seeing in the comment above, based on the fact that it's occurring during the same test.

Still digging in to find the root of this problem.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 28, 2014

I'm testing this and fixing those bugs right now. After all the tests pass, I will diff this branch with yours.

@robwhess
Copy link

OK, but I don't quite understand why you want to replicate work I've already done. Is there a reason you don't want to start fresh from my branch? This seems to be the path of least resistance going forward, since my branch is already tested and working on a CUDA machine (it sounds like you don't have a CUDA machine to test with), and I've cherry-picked around your OpenCL commits to make a clean revision history without OpenCL code. All you'd need to do is take the additional abstractions you made and the on-the-fly device checking and commit those things the new branch (or I could do that). This should be easy compared to examining a hundred calls to {const,mutable}_data() and deciding which ones should be cpu_data() and vice versa, which I've already done.

Either way, if you want to stick with this branch, can you please make me a collaborator on it so it's easier for me to contribute to this PR going forward? Like I said above, if we decide to switch to my branch, we can either do it as a fork in your repo with me as a collaborator, or I can fork into a repo under my account and make you a collaborator there.

@kloudkl
Copy link
Contributor Author

kloudkl commented Jun 28, 2014

@robwhess, I have just made all the tests pass. As you said, some of the operations can only be conducted on the CPU or the GPU pointers. There are too many traps about the exact states of each pointer and each operation. It is only controllable when the layers are aware of which mode they are running in.

You should open a PR so that @shelhamer can setup a feature branch. Then the layer-wise mode awareness will be added by you or someone else. I will rebase and focus on OpenCL staff in this one.

@shelhamer
Copy link
Member

@kloudkl @robwhess going from your latest conversation I've promoted flickr:device-abstraction to BVLC:device-abstraction so that you and any other interested developers can work on it there and PR to it. Thanks both of you for all your work on this so far.

Branch at https://github.com/BVLC/caffe/tree/device-abstraction

@shelhamer
Copy link
Member

Heads-up that #555 might simplify the CPU / GPU split... or it might just lead to more rebasing.

@shelhamer shelhamer mentioned this pull request Jun 30, 2014
6 tasks
@robwhess
Copy link

Thanks @shelhamer. @kloudkl, sorry, I should have realized your primary goal here was the OpenCL part. I'm going to continue working on the device abstraction in my own branch. I'll start by pulling in the additional abstractions @kloudkl made to layers, Solver, and Net. I'll work via a PR into BVLC:device-abstraction, which I'll start shortly.

@shelhamer
Copy link
Member

@robwhess thanks for working on the device abstraction. Note that #555 is planned for merge ahead of this, so device-abstraction will have to be adapted to the new memory interface.

As a side note, I am slightly worried by d7014d9. If the rebase went through with all the conflicts resolved properly then there should be nothing left to fix up at the end. Although it is a comfort that you have the tests passing.

@robwhess
Copy link

@shelhamer d7014d9 was my fault. The conflicts were so tedious and time consuming (several hours) that I missed a few things and had to go back and fix them.

I will wait until #555 is merged and then rebase to adapt device-abstraction.

@shelhamer
Copy link
Member

Note the CPU / GPU split portion of this work is now being carried out at #610.

@bhack
Copy link
Contributor

bhack commented Nov 26, 2014

@kloudkl
Copy link
Contributor Author

kloudkl commented Dec 1, 2014

Months later, I think that distributed training is much more important than cross-device compatibility. Many businesses cannot wait for two or more weeks to train a model even many models can be trained at the same time. It is very highly demanded to train a model using millions or more samples in a single day. Although it is still desirable to deploy the same model on multiple types of devices.

@hughperkins
Copy link

Reading through this thread, and 408, is it not entirely unreasonable to assume the migrating caffe to work with OpenCL would be both considerable amount of work, potentially conflict with several CUDA optimizations, and generally go 'against the flow' of what other Caffe contributors are looking to achieve? I think this is the case, and that's why I'm rather writing an OpenCL convolutional network library 'from the ground up', at https://github.com/hughperkins/ClConvolve , but just touching base, in case my current approach is a bit too 'not invented here'?

@momer
Copy link

momer commented Feb 11, 2015

@hughperkins Looks like an interesting project coming together.

All: given that @kloudkl will no longer be contributing, is this PR still under review, or is this specific effort of bringing some abstraction / opencl support to the project and ready to be closed?

@hughperkins
Copy link

@momer: Thank-you momer :-)

@shelhamer
Copy link
Member

@hughperkins @momer this PR is still open to remind us about the effort to abstract devices. Until this effort is revived it'll stay here as a placeholder and example of one approach. Once the abstraction to host both CUDA and OpenCL implementations arrives then @hughperkins layer implementations could be helpful!

@shelhamer
Copy link
Member

Closing as this has been carried on in #610 which will itself be replaced by a master edition for resurrection someday.

@shelhamer shelhamer closed this Mar 9, 2015
@bhack bhack mentioned this pull request Mar 31, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants