-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intel Beignet spatial convolution OpenCL compile failure #39
Comments
@gongzg |
@naibaf7 The error message indicates this is a LLVM related issue. I would suggest to switch to LLVM 3.6 to have a try. If you still have any issue, please let me know. |
@naibaf7 Another quick try is to open the file backend/src/llvm/llvm_to_gen.cpp and find the following code, then comment out the "MPM.add(createCustomLoopUnrollPass());" Then have a try with you current llvm version. But this is not recommended. I doubt whether beignet is tested with this LLVM and don't know whether there is any other issue. Anyway, for your refernece. #if !defined(ANDROID) |
@gongzg |
@gongzg Command: Results (average forward pass time):
|
@naibaf7 did you check the break down performance for each layer? I used to see very bad GEMM performance with either ISAAC or ViennaCL blas, and most of the time is for the convolution backward path and the FC layers. And I found you specified the gpu device 1, do you have more than one OCL device in your system? |
@gongzg
|
@naibaf7 could you share the average forward time and backward time. The backward time is really slow. For am example. on my BDW GT2 machine, what I got from benchmark64.prototxt is: I believe libDNN engine should be much faster at backward pass. |
@naibaf7 I just did a test on a BDW GT3e machine, get the following performance number with spatial convolution engine: |
@gongzg
Intel spatial:
The numbers are vastly different from yours, so I believe there must be something wrong. |
@naibaf7 oh, definitely No. Your SKL machine should be much faster than my GT2 machine, and should be comparable with the GT3e machine or even faster. From the log you paste above: I highly doubt whether you were really using the spatial engine. You can easily uncomment the following code in the spatial convolution source code Then, please remove .spatialkernels/* and re-run the benchmark. It will show the tuning process and print GFLOPS for each tuned kernel and the final winner kernel. |
@gongzg
|
@naibaf7 Thanks for the log. And now I know the reason: Verification was not successful, fallback to basic kernel The beignet is broken at your system which can't get correct result with the optimized spatial kernel and fall back to the naive basic kernel. That's the reason why you get bad performance number. We may need beignet team's support again to find out why your beignet is broken. |
@gongzg |
@naibaf7 See the devices list https://cgit.freedesktop.org/beignet/tree/src/cl_device_id.c |
Intel extensions in beignet are in https://cgit.freedesktop.org/beignet/tree/include/CL/cl_intel.h |
@bhack |
I think that the problem could be on libdrm and kernel version.. What versions of both are you using? |
Kernel: 4.6.4-301.fc24.x86_64
|
Mhh.. Can you add a print of fixed_local_sz[i] inside the loop and before modulo at https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3031 |
@gongzg
Cleaned out the .spatialkernels folder for every test, but same result. Driver is xorg-x11-drv-intel-2.99.917-23.20160512.fc24.x86_64 by the way. Could that be the issue? |
Have you tried to debug/print that loop? |
I don't know if this Beignet Workgroup guide is still valid. |
@bhack |
It is important to check if |
@bhack For the alexnet, the intel spatial convolution kernel always use a 1,1,16 group size which is valid for beignet. I0725 04:35:59.524305 32761 caffe.cpp:448] Average time per layer: The clinfo: Platform Name Intel Gen OCL Driver Kernel information: So it seems that beignet works fine with some SKL platforms under the above configurations. I will work with beignet team to try to reproduce your environment and issues. |
@naibaf7 could you share the latest clinfo of your machine here? I saw the clinfo (clinfo_after) you sent to me last week, there is one clover device and one Intel CPU device. |
@gongzg How can enter in https://cgit.freedesktop.org/beignet/tree/src/cl_api.c#n3036 if local_work_size is not NULL? |
@bhack those output message should not come from the spatial convolution kernel and should from some other kernels. The spatial convolution kernels don't use null kernel size. |
Ok so probably this message was generated by autotuning code. Where is "Verification was not successful, fallback to basic kernel" in code? |
@bhack This warning message is in caffe's spatial convolution file in the function: void ConvolutionLayerSpatial::setup_convolution(). |
@gongzg
|
@gongzg
assumes that the CLANG libraries will be found in the same path as LLVM, which is often not the case, so I wonder why it's not also serarching the default library folders as a secondary search path (/usr/lib or /usr/lib64, depending on if the system organizes with /usr/lib32 or not, and possibly also /usr/lib64/clang). |
@gongzg
I only found these flags in the OpenCL 2.1 specifications. Is that the new requirement?
|
@gongzg But as I see it, your kernels will fallback to the default method for backward passes, is that right? Do you intend to also develop a few backward kernels, or should we use one of the libDNN kernels for backward passes when we integrate the spatial and gemm-like kernels into libDNN? They're not (yet) optimized for Intel chips, but a lot faster than clBLAS/ViennaCL/CLBlast serialized batch fallbacks (currently takes 11201ms). Another point I want to talk about with you is parameter space of these implementations. We need to figure out and implement metadata as to which kernels can handle what kind of parameters (like 2D, 3D,...ND, dilation, stride). |
@naibaf7 Currently, I don't have a short-term plan to further optimize backward currently. But I have been optimizing the GEMM/GEMV performance for a while and the backward path depends on GEMM/GEMV implementation, so it could get some benefit. I have a much better internal implementation, and with these internal GEMM/GEMV, my performance number is as below: I0920 03:38:40.866559 26513 caffe.cpp:453] data forward: 0.0476741 ms. I am using your benchmark128.prototxt. And please be noted that my test machine is just a SKL GT2 machine which has 24 EUs. I already started the process to contribute the internal GEMM/GEMV implementation to ISAAC. Before that, I think you may try to use the libdnn to handle the convolution backward path and compare with my current performance number to see whether it's still worth to switch to libdnn for backward path. As to the metadata part, I will update some comments to the .cl files latter to specifiy which parameters are supported by these implementation. |
@naibaf7 do you still have issues with beignet? If so, could you let me know details. |
@gongzg But with beignet still the issue as above:
|
@naibaf7 that's a warning message and should not cause fatal error. You can disable these warning message by building a release version beignet. |
@gongzg
Totally stuck here, tried to find out the cause for hours. Any ideas?
./build/tools/caffe time -model models/bvlc_alexnet/benchmark64.prototxt -gpu=0 -iterations=5
The text was updated successfully, but these errors were encountered: