Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using OpenCL on Adreno & Mali GPUs is slower than CPU #5965

Closed
yukyon opened this issue Mar 9, 2024 · 8 comments
Closed

Using OpenCL on Adreno & Mali GPUs is slower than CPU #5965

yukyon opened this issue Mar 9, 2024 · 8 comments

Comments

@yukyon
Copy link

yukyon commented Mar 9, 2024

I am testing GPU offloading using llama.cpp. In the case of CUDA, as expected, performance improved during GPU offloading. However, in the case of OpenCL, the more GPUs are used, the slower the speed becomes. The Qualcomm Adreno GPU and Mali GPU I tested were similar.
image

I looked at the implementation of the opencl code in llama.cpp and figured out what the problem was. This is because it uses an implementation that copies data between the host and GPU memory.

In most embedded SoCs, the host and GPU share memory. In this case, using CL_MEM_ALLOC_HOST_PTR when using clCreateBuffer helps solve this problem.

To prove this, I wrote a simple opencl test program as shown below. The first sample was written similarly to the implementation of llama.cpp.

(ref. https://developer.arm.com/documentation/100614/0314/Optimizing-OpenCL-for-Mali-GPUs/Optimizing-memory-allocation/Do-not-create-buffers-with-CL-MEM-USE-HOST-PTR-if-possible)
opencl_sample1.cc

#include <CL/cl.h>
#include <iostream>
#include <cstdio>
#include <time.h>

// #define USE_BREAK_POINT

#define VECTOR_SIZE 1024 * 1024

const char* programSource =
    "__kernel void vectorAdd(__global const float* a, __global const float* b, __global float* c) {"
    "    int i = get_global_id(0);"
    "    c[i] = a[i] + b[i];"
    "}";

int main() {
    struct timespec start, end;
    double elapsedTime;

    clock_gettime(CLOCK_MONOTONIC, &start);

    cl_platform_id platform;
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;
    cl_mem memA, memB, memC;

    // Get platform and device
    clGetPlatformIDs(1, &platform, NULL);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

    // Create context
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

    // Create command queue
    queue = clCreateCommandQueue(context, device, 0, NULL);

    // Create program
    program = clCreateProgramWithSource(context, 1, (const char**)&programSource, NULL, NULL);

    // Build program
    clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    // Create kernel
    kernel = clCreateKernel(program, "vectorAdd", NULL);

    // Create memory buffers
    memA = clCreateBuffer(context, CL_MEM_READ_ONLY, VECTOR_SIZE * sizeof(float), NULL, NULL);
    memB = clCreateBuffer(context, CL_MEM_READ_ONLY, VECTOR_SIZE * sizeof(float), NULL, NULL);
    memC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, VECTOR_SIZE * sizeof(float), NULL, NULL);

#ifdef USE_BREAK_POINT
    std::cout<< "created 3 gpu buffer. waiting any key...\n";
    getchar(); //stop
#endif

    // Set kernel arguments
    clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&memA);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&memB);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&memC);

    // Enqueue write buffer
    float* hostA = new float[VECTOR_SIZE];
    float* hostB = new float[VECTOR_SIZE];
    for (int i = 0; i < VECTOR_SIZE; i++) {
        hostA[i] = i;
        hostB[i] = VECTOR_SIZE - i;
    }
    // Enqueue write buffer
    // This is the part that specifies the input data.
    // Here, memory on the host side is allocated, values are specified, and then copied to memA and memB in GPU memory.
    // Therefore, host-side memory increases and speed slows down due to data copying.
    clEnqueueWriteBuffer(queue, memA, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), hostA, 0, NULL, NULL);
    clEnqueueWriteBuffer(queue, memB, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), hostB, 0, NULL, NULL);

#ifdef USE_BREAK_POINT
    std::cout<< "write data to 2 input gpu buffer. waiting any key...\n";
    getchar(); //stop
#endif

    // Set work size and enqueue kernel
    size_t globalSize = VECTOR_SIZE;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);

#ifdef USE_BREAK_POINT
    std::cout<< "executed opencl vectorAdd kernel. waiting any key...\n";
    getchar(); //stop
#endif

    // Enqueue read buffer
    // In order to bring the calculated results to GPU memory, memory is also allocated on the host side.
     // During clEnqueueReadBuffer, copying occurs from GPU memC memory to CPU hostC memory.
    float* hostC = new float[VECTOR_SIZE];
    clEnqueueReadBuffer(queue, memC, CL_TRUE, 0, VECTOR_SIZE * sizeof(float), hostC, 0, NULL, NULL);

    // Print result
    for (int i = 0; i < 10; i++) {
        std::cout << hostC[i] << " ";
    }
    std::cout << std::endl;

#ifdef USE_BREAK_POINT
    std::cout<< "read result from gpu buffer. waiting any key...\n";
    getchar(); //stop
#endif

    // Clean up
    clReleaseMemObject(memA);
    clReleaseMemObject(memB);
    clReleaseMemObject(memC);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

    delete[] hostA;
    delete[] hostB;
    delete[] hostC;

#ifdef USE_BREAK_POINT
    std::cout<< "released all cpu/gpu data. waiting any key...\n";
    getchar(); //stop
#endif

    clock_gettime(CLOCK_MONOTONIC, &end); //end

    elapsedTime = (end.tv_sec - start.tv_sec);
    elapsedTime += (end.tv_nsec - start.tv_nsec) / 1000000000.0;

    std::cout << "Execusion time: " << elapsedTime << " seconds\n";

    return 0;
}

(Result)

# ./opencl_sample1
1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06
Execusion time: 1.21561 seconds

The second sample used CL_MEM_ALLOC_HOST_PTR.

(ref. https://developer.arm.com/documentation/100614/0314/Optimizing-OpenCL-for-Mali-GPUs/Optimizing-memory-allocation/Use-CL-MEM-ALLOC-HOST-PTR-to-avoid-copying-memory)
opencl_sample2.cc (use CL_MEM_ALLOC_HOST_PTR)

#include <CL/cl.h>
#include <iostream>
#include <cstdio>
#include <time.h>

// #define USE_BREAK_POINT

#define VECTOR_SIZE 1024 * 1024

const char* programSource =
    "__kernel void vectorAdd(__global const float* a, __global const float* b, __global float* c) {"
    "    int i = get_global_id(0);"
    "    c[i] = a[i] + b[i];"
    "}";

int main() {
    struct timespec start, end;
    double elapsedTime;

    clock_gettime(CLOCK_MONOTONIC, &start);

    cl_platform_id platform;
    cl_device_id device;
    cl_context context;
    cl_command_queue queue;
    cl_program program;
    cl_kernel kernel;
    cl_mem memA, memB, memC;

    // Get platform and device
    clGetPlatformIDs(1, &platform, NULL);
    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);

    // Create context
    context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);

    // Create command queue
    queue = clCreateCommandQueue(context, device, 0, NULL);

    // Create program
    program = clCreateProgramWithSource(context, 1, (const char**)&programSource, NULL, NULL);

    // Build program
    clBuildProgram(program, 1, &device, NULL, NULL, NULL);

    // Create kernel
    kernel = clCreateKernel(program, "vectorAdd", NULL);

    // Create memory buffers
    // When using clCreateBuffer, add the CL_MEM_ALLOC_HOST_PTR option.
    // Here, memA, mebB, and memC in cl_mem format are created as memories on the GPU side.
    memA = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, VECTOR_SIZE * sizeof(float), NULL, NULL);
    memB = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, VECTOR_SIZE * sizeof(float), NULL, NULL);
    memC = clCreateBuffer(context, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, VECTOR_SIZE * sizeof(float), NULL, NULL);

#ifdef USE_BREAK_POINT
    std::cout<< "created 3 gpu buffer. waiting any key...\n";
    getchar(); //stop
#endif

    // Set kernel arguments
    clSetKernelArg(kernel, 0, sizeof(cl_mem), (void*)&memA);
    clSetKernelArg(kernel, 1, sizeof(cl_mem), (void*)&memB);
    clSetKernelArg(kernel, 2, sizeof(cl_mem), (void*)&memC);

    // Enqueue write buffer
    // Here, memory is not created on the host side,
    // Obtain a shared pointer that can access GPU memory through the clEnqueueMapBuffer function.
    float* hostA = (float*)clEnqueueMapBuffer(queue, memA, CL_TRUE, CL_MAP_WRITE, 0, VECTOR_SIZE * sizeof(float), 0, NULL, NULL, NULL);
    float* hostB = (float*)clEnqueueMapBuffer(queue, memB, CL_TRUE, CL_MAP_WRITE, 0, VECTOR_SIZE * sizeof(float), 0, NULL, NULL, NULL);
    // Change the contents of the buffer using the shared memory pointer received.
    for (int i = 0; i < VECTOR_SIZE; i++) {
        hostA[i] = i;
        hostB[i] = VECTOR_SIZE - i;
    }
    // When you are finished using it, you must release the mapping.
    clEnqueueUnmapMemObject(queue, memA, hostA, 0, NULL, NULL);
    clEnqueueUnmapMemObject(queue, memB, hostB, 0, NULL, NULL);

#ifdef USE_BREAK_POINT
    std::cout<< "write data to 2 input gpu buffer. waiting any key...\n";
    getchar(); //stop
#endif

    // Set work size and enqueue kernel
    size_t globalSize = VECTOR_SIZE;
    clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, NULL, 0, NULL, NULL);

#ifdef USE_BREAK_POINT
    std::cout<< "executed opencl vectorAdd kernel. waiting any key...\n";
    getchar(); //stop
#endif

    // Enqueue read buffer
    // Read the calculation result from memC in GPU memory.
​    float* hostC = (float*)clEnqueueMapBuffer(queue, memC, CL_TRUE, CL_MAP_READ, 0, VECTOR_SIZE * sizeof(float), 0, NULL, NULL, NULL);

    // Print result
    for (int i = 0; i < 10; i++) {
        std::cout << hostC[i] << " ";
    }
    std::cout << std::endl;

#ifdef USE_BREAK_POINT
    std::cout<< "read result from gpu buffer. waiting any key...\n";
    getchar(); //stop
#endif

    // Clean up
    clReleaseMemObject(memA);
    clReleaseMemObject(memB);
    clReleaseMemObject(memC);
    clReleaseKernel(kernel);
    clReleaseProgram(program);
    clReleaseCommandQueue(queue);
    clReleaseContext(context);

#ifdef USE_BREAK_POINT
    std::cout<< "released all cpu/gpu data. waiting any key...\n";
    getchar(); //stop
#endif

    clock_gettime(CLOCK_MONOTONIC, &end); //end

    elapsedTime = (end.tv_sec - start.tv_sec);
    elapsedTime += (end.tv_nsec - start.tv_nsec) / 1000000000.0;

    std::cout << "Execusion time: " << elapsedTime << " seconds\n";

    return 0;
}

(Result)

# ./opencl_sample2
1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06 1.04858e+06
Execusion time: 0.085987 seconds

the first takes 1.22 seconds and the second takes 0.09 seconds. In other words, there is a difference of about 13.6 times in speed. Another problem with the first example is that the same amount of memory as the GPU is allocated to the CPU, which nearly doubles the memory consumption.

I wanted to fix the opencl implementation code in llama.cpp, but I couldn't figure out how to do it. Is it possible to change the memory allocation method to improve opencl performance?

Or please let me know how I can fix it.

@Jeximo
Copy link
Contributor

Jeximo commented Mar 9, 2024

I'd test a solution fixing OpenCL for Android. Related: #5621 (comment)

@FSSRepo
Copy link
Collaborator

FSSRepo commented Mar 9, 2024

The truth is that the OpenCL backend was set aside in GGML; it doesn't even have a separate interface for implementing kernels like in the case of the other backends CUDA, Metal, Vulkan, so it will be very difficult to improve it. OpenCL only accelerates matrix multiplications and some ops (adds and mult), but there are still more operations that require optimized kernels.

I believe the best approach would be to improve the Vulkan backend and make it compatible with mobile Vulkan (android devices).

I wanted to fix the opencl implementation code in llama.cpp, but I couldn't figure out how to do it. Is it possible to change the memory allocation method to improve opencl performance?

You can add CL_MEM_ALLOC_HOST_PTR here, and test if there is some improvement:

llama.cpp/ggml-opencl.cpp

Lines 1335 to 1337 in d894f35

cl_mem mem;
CL_CHECK((mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err), err));
*actual_size = size;

llama.cpp/ggml-opencl.cpp

Lines 2104 to 2106 in d894f35

cl_int err;
cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);
if (err != CL_SUCCESS) {

@pure-water
Copy link

That is really great work. What would be the case in Vulkan? I would say in way ahead Vulkan might be more worthy of time investment compared to OCL on the mobile compute path (argubaly)?

@pure-water
Copy link

I believe the best approach would be to improve the Vulkan backend and make it compatible with mobile Vulkan (android devices).

Agreed.

@akingoverlook
Copy link

The clblast "backend" is really more like OpenCL BLAS lib than a backend. At best it would improve the prompt processing rate, but not the decode rate. That is, if its performance issues were resolved. Using host_ptr should help, but memory bandwidth with mobile GPUs is still low, especially in the GPU-to-host direction.

It won't get super fast. The MLCchat (via TVM) uses OpenCL backend for Android, and it has received a ton of work by Qualcomm to optimize it. They do various tricks, like using textures instead of buffers, changing the wights layout to avoid some ops, unrolling some loops differently, etc. There is a bunch of "if Android use those very special kernels with OpenCL" kind of logic.

But after all said and done, it is just faster than CPU on the master branch, but is already slower than the ARM matmul Int8 PR that presumably will be merged soon enough.

Vulkan might be faster one day, but for now it is not really usable with Adreno, due to some driver/shader compatibility problems and small max allocation size. Also, Vulkan is still 10x slower than CUDA even on platforms where both are supported well (e.g., RTX 3080).

@woachk
Copy link
Contributor

woachk commented Mar 22, 2024

The Adreno OpenCL drivers are known as quite subpar overall (sadly) - also some very peculiar extensions used including cl_qcom_ml_ops. Had a much better experience with the Mali OpenCL stack.

@akingoverlook
Copy link

The truth is that the OpenCL backend was set aside in GGML; it doesn't even have a separate interface for implementing kernels like in the case of the other backends CUDA, Metal, Vulkan, so it will be very difficult to improve it. OpenCL only accelerates matrix multiplications and some ops (adds and mult), but there are still more operations that require optimized kernels.

I believe the best approach would be to improve the Vulkan backend and make it compatible with mobile Vulkan (android devices).

I wanted to fix the opencl implementation code in llama.cpp, but I couldn't figure out how to do it. Is it possible to change the memory allocation method to improve opencl performance?

You can add CL_MEM_ALLOC_HOST_PTR here, and test if there is some improvement:

llama.cpp/ggml-opencl.cpp

Lines 1335 to 1337 in d894f35

cl_mem mem;
CL_CHECK((mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err), err));
*actual_size = size;

llama.cpp/ggml-opencl.cpp

Lines 2104 to 2106 in d894f35

cl_int err;
cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, size, NULL, &err);
if (err != CL_SUCCESS) {

Don't think that would do anything by itself. You also need to map the buffer using clEnqueueMapBuffer() and then change any code that was previously copying the buffers to just use the mapped pointer instead.

And that is assuming the allocations are all done by OpenCL, which might not be the case since it is a "partial offload" backend. Handling zero-copy for other buffers would require using more QC/Android extensions (cl_qcom_dmabuf_host_ptr, cl_qcom_android_native_buffer_host_ptr, cl_qcom_android_ahardwarebuffer_host_ptr) and more Android-specific hacks elsewhere to use the suitable allocators.

In theory you could also use SVM buffer sharing & atomics, generically. That may get even more complex to do right.

Search for 80-nb295-11_c.pdf.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants