-
Notifications
You must be signed in to change notification settings - Fork 18.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unify the CPU, CUDA and OpenCL math functions API in the device wrapper classes #415
Conversation
Hmm...it looks like you're trying to abstract away the CPU/GPU distinction so we might not have to write separate code for CPU/GPU. This would be nice, but I'm not really sure it's feasible through something like this, and it would have to come at no (or very minimal) cost to performance. I might be wrong though -- if you think we really can abstract away the distinction without incurring performance costs, feel free to continue down this path. What I meant in my comment in #408 was simply to move all CUDA-specific functionality in the main code into wrapper classes (so that these wrappers could then be reimplemented in OpenCL) -- basically continuing in the spirit of math_functions.cu (and using these already available functions where they aren't used in the code, eg |
Remember also that opencl have CPU support like pocl, AMD and INTEL CPU sdk/backend. |
This method does work as expected at least for the ConcatLayer. There is one caveat using platform independent version of Forward and Backward. The mode has to be set before the layer is constructed. It is not the case in the original test_concat_layer.cpp. The TestCPUNum for the double type is acually constructed in GPU mode set by the TestGPUGradient for the float type. Then the mode is set to CPU. This caused mutable_data calling mutable_cpu_data and this->math_.copy calling caffe_gpu_mode since this->math_ is initialized in the constructor of Layer.
I did not choose to call MathBackendFactory::GetMathBackend on the fly because the mode will probably change during the life time of a layer object. Locking a layer object in a mode when it is created makes it device type safe. Maybe the device type should be added to the constructor's parameter list or the template parameters of the layers. |
@jeffdonahue was right. I should "move all CUDA-specific functionality in the main code into wrapper classes". |
The CPU/GPU versions of the Forward/Backward methods of all the layers that don't use kernels in their GPU version of these methods have been unified by using the device wrapper classes. Duplicate codes have been greatly eliminated. |
Since I don't have access to GPU right now, only CPU codes can be tested. The result of
Welcome anyone interested in this feature help me run all the tests. You will need to install hub. Checkout this PR by running |
I don't think it's worth the complication to further merge the layers that use CUDA kernel functions in their {Forward, Backward}_gpu methods. So this PR is done and ready to be reviewed. |
With regard to #408, it appears more difficult to also unify the API of clBLAS. Take the single precision general matrix-matrix multiplication as an example, the clBLAS API is quite different from the BLAS/cuBLAS APIs since it adds many extra parameters and uses cl_mem instead of float* to pass in arrays.
|
Please take a look how it is tried to be managed here (with generic_blas and CUDA and opencl file): https://github.com/Theano/libgpuarray/tree/master/src |
The OpenCLDevice methods are directly inspired by the implementations of Theano/libgpuarray.
|
If you have a supported intel platform you could test on gpu with the official Intel opensource implementation: For Intel sdk and ubuntu 14.04 AMD opencl SDK work also on cpu |
When installing opencl_runtime_14.1_x64_4.4.0.117.tgz and intel_sdk_for_ocl_applications_2014_ubuntu_4.4.0.117_x64.tgz on Ubuntu 14.04, the package management system cann't identify the installed deb files.
Someone said that they could be installed on Ubuntu 13.04 and 12.04. |
My laptop only has Intel CPU. Is AMD SDK effective? |
Seems that you are trying to install two version: opencl-1.2-intel-cpu-4.4.0.117-1.x86_64.deb that conflict with: Why you have two versions? As i can remember AMD SDK works on X86 CPU with SSE 2.x or later (also non amd CPU) Beignet actually works on Intel Ivy Bridge gpu. So if you have an Ivy Bridge laptop you can test also beignet. |
Obviously, AMD provides more flexible cross-platform OpenCL SDK to survive in the market while Intel does not bother taking care of other vendors. |
Yes but generally the ICD loader let you to have multivendor/multi-implementation (and multidevice) with max flexibility for the user http://wiki.tiker.net/OpenCLHowTo |
Can anyone comment on the status of this PR? I'm not so much interested in the OpenCL stuff itself, but the abstraction here is nice and would make it easier to modify the build process to compile only CPU code if desired, which I am interested in. If this PR might be merged soon, I could branch off of it to start that work on the build process. |
@robwhess +1 |
+1 @robwhess on this. Would be glad to help out with the testing |
@Yangqing @jeffdonahue let's take a look at this after CVPR and see if we can bring this to a nice, abstract conclusion. @robwhess I agree the CPU/GPU split progress is important. If you'd like to help review this in light of the work you have planned, please do comment inline and we'll see if this can be merged soon. |
My GPU tests fail poorly on this branch when using CUDA: ...
[----------] 9 tests from ConvolutionLayerTest/1, where TypeParam = double
[ RUN ] ConvolutionLayerTest/1.TestSetup
[ OK ] ConvolutionLayerTest/1.TestSetup (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestCPUSimpleConvolution
[ OK ] ConvolutionLayerTest/1.TestCPUSimpleConvolution (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestGPUSimpleConvolution
make: *** [runtest] Bus error: 10 I'm digging in to see what's causing this problem. Note that I'm not using OpenCL (and don't even have clBLAS installed). To get the code to compile, I had to add (or uncomment)
I also had to add an |
I'm also occasionally getting a nasty crash running the tests that hangs my whole machine (Mac OS X 10.9) and requires a hard reboot. I managed to capture this stack trace before total freeze: ...
[----------] 9 tests from ConvolutionLayerTest/1, where TypeParam = double
[ RUN ] ConvolutionLayerTest/1.TestSetup
[ OK ] ConvolutionLayerTest/1.TestSetup (0 ms)
[ RUN ] ConvolutionLayerTest/1.TestGPUSimpleConvolution
F0624 15:54:26.503756 2042675984 syncedmem.cpp:35] Check failed: error == cudaSuccess (4 vs. 0) unspecified launch failure
*** Check failure stack trace: ***
@ 0x108ffba8a google::LogMessage::Fail()
@ 0x108fface8 google::LogMessage::SendToLog()
@ 0x108ffb73a google::LogMessage::Flush()
@ 0x108fff0f8 google::LogMessageFatal::~LogMessageFatal()
@ 0x108ffbf25 google::LogMessageFatal::~LogMessageFatal()
@ 0x10399cebc caffe::SyncedMemory::to_cpu()
@ 0x10399cc1f caffe::SyncedMemory::cpu_data()
@ 0x1039547e7 caffe::Blob<>::cpu_data()
@ 0x1037fc490 caffe::ConvolutionLayerTest_TestGPUSimpleConvolution_Test<>::TestBody()
@ 0x10392319c testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x1039133aa testing::Test::Run()
@ 0x1039142f2 testing::TestInfo::Run()
@ 0x1039149c0 testing::TestCase::Run()
@ 0x103919f07 testing::internal::UnitTestImpl::RunAllTests()
@ 0x103923a94 testing::internal::HandleExceptionsInMethodIfSupported<>()
@ 0x103919c19 testing::UnitTest::Run()
@ 0x1037cc669 main
@ 0x7fff8a3fb5fd start
make: *** [runtest] Abort trap: 6 (I'm assuming this is related to the freeze.) This may be related to the bus error I was seeing in the comment above, based on the fact that it's occurring during the same test. Still digging in to find the root of this problem. |
I'm testing this and fixing those bugs right now. After all the tests pass, I will diff this branch with yours. |
OK, but I don't quite understand why you want to replicate work I've already done. Is there a reason you don't want to start fresh from my branch? This seems to be the path of least resistance going forward, since my branch is already tested and working on a CUDA machine (it sounds like you don't have a CUDA machine to test with), and I've cherry-picked around your OpenCL commits to make a clean revision history without OpenCL code. All you'd need to do is take the additional abstractions you made and the on-the-fly device checking and commit those things the new branch (or I could do that). This should be easy compared to examining a hundred calls to Either way, if you want to stick with this branch, can you please make me a collaborator on it so it's easier for me to contribute to this PR going forward? Like I said above, if we decide to switch to my branch, we can either do it as a fork in your repo with me as a collaborator, or I can fork into a repo under my account and make you a collaborator there. |
@robwhess, I have just made all the tests pass. As you said, some of the operations can only be conducted on the CPU or the GPU pointers. There are too many traps about the exact states of each pointer and each operation. It is only controllable when the layers are aware of which mode they are running in. You should open a PR so that @shelhamer can setup a feature branch. Then the layer-wise mode awareness will be added by you or someone else. I will rebase and focus on OpenCL staff in this one. |
@kloudkl @robwhess going from your latest conversation I've promoted Branch at https://github.com/BVLC/caffe/tree/device-abstraction |
Heads-up that #555 might simplify the CPU / GPU split... or it might just lead to more rebasing. |
Thanks @shelhamer. @kloudkl, sorry, I should have realized your primary goal here was the OpenCL part. I'm going to continue working on the device abstraction in my own branch. I'll start by pulling in the additional abstractions @kloudkl made to layers, |
@robwhess thanks for working on the device abstraction. Note that #555 is planned for merge ahead of this, so As a side note, I am slightly worried by d7014d9. If the rebase went through with all the conflicts resolved properly then there should be nothing left to fix up at the end. Although it is a comfort that you have the tests passing. |
@shelhamer d7014d9 was my fault. The conflicts were so tedious and time consuming (several hours) that I missed a few things and had to go back and fix them. I will wait until #555 is merged and then rebase to adapt |
Note the CPU / GPU split portion of this work is now being carried out at #610. |
4278286
to
c01f07a
Compare
Please consider also this news: Arrayfire is now under BSD |
Months later, I think that distributed training is much more important than cross-device compatibility. Many businesses cannot wait for two or more weeks to train a model even many models can be trained at the same time. It is very highly demanded to train a model using millions or more samples in a single day. Although it is still desirable to deploy the same model on multiple types of devices. |
Reading through this thread, and 408, is it not entirely unreasonable to assume the migrating caffe to work with OpenCL would be both considerable amount of work, potentially conflict with several CUDA optimizations, and generally go 'against the flow' of what other Caffe contributors are looking to achieve? I think this is the case, and that's why I'm rather writing an OpenCL convolutional network library 'from the ground up', at https://github.com/hughperkins/ClConvolve , but just touching base, in case my current approach is a bit too 'not invented here'? |
@hughperkins Looks like an interesting project coming together. All: given that @kloudkl will no longer be contributing, is this PR still under review, or is this specific effort of bringing some abstraction / opencl support to the project and ready to be closed? |
@momer: Thank-you momer :-) |
@hughperkins @momer this PR is still open to remind us about the effort to abstract devices. Until this effort is revived it'll stay here as a placeholder and example of one approach. Once the abstraction to host both CUDA and OpenCL implementations arrives then @hughperkins layer implementations could be helpful! |
Closing as this has been carried on in #610 which will itself be replaced by a master edition for resurrection someday. |
This PR wraps the math functions as suggested by both Yangqing in #382 and jeffdonahue in #408 to abstract the device type from the algorithms.