-
Notifications
You must be signed in to change notification settings - Fork 9.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using OpenCL on Adreno & Mali GPUs is slower than CPU #5965
Comments
I'd test a solution fixing OpenCL for Android. Related: #5621 (comment) |
The truth is that the OpenCL backend was set aside in GGML; it doesn't even have a separate interface for implementing kernels like in the case of the other backends CUDA, Metal, Vulkan, so it will be very difficult to improve it. OpenCL only accelerates matrix multiplications and some ops (adds and mult), but there are still more operations that require optimized kernels. I believe the best approach would be to improve the Vulkan backend and make it compatible with mobile Vulkan (android devices).
You can add Lines 1335 to 1337 in d894f35
Lines 2104 to 2106 in d894f35
|
That is really great work. What would be the case in Vulkan? I would say in way ahead Vulkan might be more worthy of time investment compared to OCL on the mobile compute path (argubaly)? |
Agreed. |
The clblast "backend" is really more like OpenCL BLAS lib than a backend. At best it would improve the prompt processing rate, but not the decode rate. That is, if its performance issues were resolved. Using host_ptr should help, but memory bandwidth with mobile GPUs is still low, especially in the GPU-to-host direction. It won't get super fast. The MLCchat (via TVM) uses OpenCL backend for Android, and it has received a ton of work by Qualcomm to optimize it. They do various tricks, like using textures instead of buffers, changing the wights layout to avoid some ops, unrolling some loops differently, etc. There is a bunch of "if Android use those very special kernels with OpenCL" kind of logic. But after all said and done, it is just faster than CPU on the master branch, but is already slower than the ARM matmul Int8 PR that presumably will be merged soon enough. Vulkan might be faster one day, but for now it is not really usable with Adreno, due to some driver/shader compatibility problems and small max allocation size. Also, Vulkan is still 10x slower than CUDA even on platforms where both are supported well (e.g., RTX 3080). |
The Adreno OpenCL drivers are known as quite subpar overall (sadly) - also some very peculiar extensions used including |
Don't think that would do anything by itself. You also need to map the buffer using clEnqueueMapBuffer() and then change any code that was previously copying the buffers to just use the mapped pointer instead. And that is assuming the allocations are all done by OpenCL, which might not be the case since it is a "partial offload" backend. Handling zero-copy for other buffers would require using more QC/Android extensions (cl_qcom_dmabuf_host_ptr, cl_qcom_android_native_buffer_host_ptr, cl_qcom_android_ahardwarebuffer_host_ptr) and more Android-specific hacks elsewhere to use the suitable allocators. In theory you could also use SVM buffer sharing & atomics, generically. That may get even more complex to do right. Search for 80-nb295-11_c.pdf. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I am testing GPU offloading using llama.cpp. In the case of CUDA, as expected, performance improved during GPU offloading. However, in the case of OpenCL, the more GPUs are used, the slower the speed becomes. The Qualcomm Adreno GPU and Mali GPU I tested were similar.
I looked at the implementation of the opencl code in llama.cpp and figured out what the problem was. This is because it uses an implementation that copies data between the host and GPU memory.
In most embedded SoCs, the host and GPU share memory. In this case, using CL_MEM_ALLOC_HOST_PTR when using clCreateBuffer helps solve this problem.
To prove this, I wrote a simple opencl test program as shown below. The first sample was written similarly to the implementation of llama.cpp.
(ref. https://developer.arm.com/documentation/100614/0314/Optimizing-OpenCL-for-Mali-GPUs/Optimizing-memory-allocation/Do-not-create-buffers-with-CL-MEM-USE-HOST-PTR-if-possible)
opencl_sample1.cc
(Result)
The second sample used CL_MEM_ALLOC_HOST_PTR.
(ref. https://developer.arm.com/documentation/100614/0314/Optimizing-OpenCL-for-Mali-GPUs/Optimizing-memory-allocation/Use-CL-MEM-ALLOC-HOST-PTR-to-avoid-copying-memory)
opencl_sample2.cc (use CL_MEM_ALLOC_HOST_PTR)
(Result)
the first takes 1.22 seconds and the second takes 0.09 seconds. In other words, there is a difference of about 13.6 times in speed. Another problem with the first example is that the same amount of memory as the GPU is allocated to the CPU, which nearly doubles the memory consumption.
I wanted to fix the opencl implementation code in llama.cpp, but I couldn't figure out how to do it. Is it possible to change the memory allocation method to improve opencl performance?
Or please let me know how I can fix it.
The text was updated successfully, but these errors were encountered: