Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anyone got CLBLAST working on Intel macOS with AMD GPU? Is it meant to work? #1429

Closed
TheBloke opened this issue May 13, 2023 · 19 comments
Closed

Comments

@TheBloke
Copy link
Contributor

Hi all

I just learned about CLBLAST so wanted to try it at home on my Intel macOS system with AMD 6900XT GPU.

I have no idea if it's meant to work on this system or with AMD GPUs? Maybe it's only designed for NV on Linux or Windows at the moment? But I figured as it's using OpenCL, it should work with any GPU? Maybe? :)

Installing CLBLAST is easy:

tomj@Eddie ~/src/llama.cpp (master●●)$ brew install clblast
==> Downloading https://formulae.brew.sh/api/formula.jws.json
....
==> Pouring clblast--1.5.3_1.ventura.bottle.tar.gz
🍺  /usr/local/Cellar/clblast/1.5.3_1: 41 files, 11.6MB
==> Running `brew cleanup clblast`...
Disable this behaviour by setting HOMEBREW_NO_INSTALL_CLEANUP.
Hide these hints with HOMEBREW_NO_ENV_HINTS (see `man brew`).

Compiling went fine:

tomj@Eddie ~/src/llama.cpp (master●●)$ make clean && LLAMA_CLBLAST=1 make
......
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c llama.cpp -o llama.o
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native -c examples/common.cpp -o common.o
examples/common.cpp:750:24: warning: comparison of integers of different signs: 'char32_t' and '__darwin_wint_t' (aka 'int') [-Wsign-compare]
        if (input_char == WEOF || input_char == 0x04 /* Ctrl+D*/) {
            ~~~~~~~~~~ ^  ~~~~
examples/common.cpp:765:45: warning: comparison of integers of different signs: 'char32_t' and '__darwin_wint_t' (aka 'int') [-Wsign-compare]
                while ((code = getchar32()) != WEOF) {
                        ~~~~~~~~~~~~~~~~~~  ^  ~~~~
2 warnings generated.
cc -I.              -O3 -std=c11   -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wdouble-promotion -Wshadow -Wstrict-prototypes -Wpointer-arith -pthread -march=native -mtune=native -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -c ggml-opencl.c -o ggml-opencl.o
In file included from ggml-opencl.c:4:
/usr/local/include/clblast_c.h:1686:47: warning: a function declaration without a prototype is deprecated in all versions of C [-Wstrict-prototypes]
CLBlastStatusCode PUBLIC_API CLBlastClearCache();
                                              ^
                                               void
1 warning generated.
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/main/main.cpp ggml.o llama.o common.o ggml-opencl.o -o main  -framework Accelerate -lclblast -framework OpenCL

====  Run ./main -h for help.  ====

c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize/quantize.cpp ggml.o llama.o ggml-opencl.o -o quantize  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/quantize-stats/quantize-stats.cpp ggml.o llama.o ggml-opencl.o -o quantize-stats  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/perplexity/perplexity.cpp ggml.o llama.o common.o ggml-opencl.o -o perplexity  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native examples/embedding/embedding.cpp ggml.o llama.o common.o ggml-opencl.o -o embedding  -framework Accelerate -lclblast -framework OpenCL
c++ -I. -I./examples -O3 -std=c++11 -fPIC -DNDEBUG -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -Wno-multichar -pthread -march=native -mtune=native pocs/vdot/vdot.cpp ggml.o ggml-opencl.o -o vdot  -framework Accelerate -lclblast -framework OpenCL
tomj@Eddie ~/src/llama.cpp (master●●●)$

First attempt got this problem - it's using the wrong device:

tomj@Eddie ~/src/llama.cpp (master●●)$ ./main -t 16 -m ~/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin -n 512 -p "### Instruction: write a story about llamas\n### Response:"
main: build = 540 (f048af0)
main: seed  = 1683982897
llama.cpp: loading model from /Users/tomj/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 11359.04 MB (+ 1608.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=0 (If invalid, program will crash)
Using Platform: Apple Device: Intel(R) Core(TM) i9-10980XE CPU @ 3.00GHz
OpenCL clCreateCommandQueue error -30 at ggml-opencl.c:215

My CPU, not GPU.

So I edited ggml-opencl.c and changed this line to device 1 :

  int dev_num = (GGML_CLBLAST_DEVICE == NULL ? 1 : atoi(GGML_CLBLAST_DEVICE));

Now it tries to use my GPU, but still fails with exactly the same error:

tomj@Eddie ~/src/llama.cpp (master●●●)$ ./main -t 16 -m ~/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin -n 512 -p "### Instruction: write a story about llamas\n### Response:"
main: build = 540 (f048af0)
main: seed  = 1683983030
llama.cpp: loading model from /Users/tomj/src/huggingface/Wizard-Vicuna-13B-Uncensored-GGML/Wizard-Vicuna-13B-Uncensored.ggml.q5_1.bin
llama_model_load_internal: format     = ggjt v2 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 9 (mostly Q5_1)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  85.08 KB
llama_model_load_internal: mem required  = 11359.04 MB (+ 1608.00 MB per state)

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=1 (If invalid, program will crash)
Using Platform: Apple Device: AMD Radeon RX 6900 XT Compute Engine
OpenCL clCreateCommandQueue error -30 at ggml-opencl.c:215

I've never used CLBLAST before so no clue what this error means or what might be wrong.

Any help or advice would be appreciated!

@swittk
Copy link
Contributor

swittk commented May 13, 2023

Personally I just change the argument in clCreateCommandQueue in ggml-opencl.c here to simply have no flags.

queue = clCreateCommandQueue(context, device, 0, &err);

And it should compile and run fine!
(Mac OS OpenCL doesn't support CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE)

@TheBloke
Copy link
Contributor Author

Thanks for the fast reply! That certainly got it running. But..

Initializing CLBlast (First Run)...
Attempting to use: Platform=0, Device=1 (If invalid, program will crash)
Using Platform: Apple Device: AMD Radeon RX 6900 XT Compute Engine
<program source>:3:131: error: fields must have a constant size: 'variable length array in structure' extension will never be supported
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; constant uint QK4_0 = 32; struct block_q4_0 { float d; uint8_t qs[QK4_0 / 2]; }; constant uint QK4_1 = 32; struct block_q4_1 { float d; float m; uint8_t qs[QK4_1 / 2]; }; constant uint QK5_0 = 32; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[QK5_0 / 2]; }; constant uint QK5_1 = 32; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[QK5_1 / 2]; }; constant uint QK8_0 = 32; struct block_q8_0 { float d; uint8_t qs[QK8_0]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { constant uint qk = QK4_0; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { constant uint qk = QK4_1; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { constant uint qk = QK5_0; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { constant uint qk = QK5_1; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { constant uint qk = QK8_0; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }

Hmm :)

Have you got it running OK on macOS with AMD GPU?

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 13, 2023

I’ll try to fix it.

@TheBloke
Copy link
Contributor Author

Thank you very much!

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 13, 2023

I'll try to rewrite the CL platform and device selection, too.

@skidd-level-100
Copy link

Hey friend of mine has the same issue on AMD but with -6 instead of -30, latest pull does not work!

@swittk
Copy link
Contributor

swittk commented May 14, 2023

@skidd-level-100 -6 error is for CL_OUT_OF_HOST_MEMORY, so maybe the GPU has very little VRAM?

@TheBloke I haven't tried it on my Intel Mac.. sorry, didn't think Apple's drivers would have different API supports between architectures. Might also try SlyEcho's fix later on a Hackintosh to see if my Clover & OpenCore loaders crash too.

@skidd-level-100
Copy link

@skidd-level-100 -6 error is for CL_OUT_OF_HOST_MEMORY, so maybe the GPU has very little VRAM?

@TheBloke I haven't tried it on my Intel Mac.. sorry, didn't think Apple's drivers would have different API supports between architectures. Might also try SlyEcho's fix later on a Hackintosh to see if my Clover & OpenCore loaders crash too.

On linux btw (fedora) CLblast was working fine on integrated, but then we recompiled with latest pull and it broke with -6

@TheBloke
Copy link
Contributor Author

@skidd-level-100 -6 error is for CL_OUT_OF_HOST_MEMORY, so maybe the GPU has very little VRAM?

@TheBloke I haven't tried it on my Intel Mac.. sorry, didn't think Apple's drivers would have different API supports between architectures. Might also try SlyEcho's fix later on a Hackintosh to see if my Clover & OpenCore loaders crash too.

OK thanks. FYI I'm on a Hackintosh too. No sign of any crashes, but I tried Sly's fix and got exactly the same issue with the long error message about quantisation formats.

Maybe I'll test with an fp16 model later

@nicdesousa
Copy link

Please see: https://github.com/ggerganov/llama.cpp/pull/1435/files#r1193090108
I'm running on an:

  Model Name:		MacBook Pro
  Model Identifier:	MacBookPro15,3
  Processor Name:	8-Core Intel Core i9
  Processor Speed:	2.4 GHz
  Memory:		32 GB

clinfo:

  Device Name                                     AMD Radeon Pro Vega 20 Compute Engine
  Device Vendor                                   AMD
  Device Vendor ID                                0x1021d00
  Device Version                                  OpenCL 1.2
  Driver Version                                  1.2 (Mar 14 2023 21:39:54)
  Device OpenCL C Version                         OpenCL C 1.2
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Max compute units                               20
  Max clock frequency                             740MHz
  Device Partition                                (core)
    Max number of sub-devices                     0
    Supported partition types                     None
    Supported affinity domains                    (n/a)
  Max work item dimensions                        3
  Max work item sizes                             256x256x256
  Max work group size                             256
  Preferred work group size multiple (kernel)     64
  Preferred / native vector sizes
    char                                                 4 / 4
    short                                                2 / 2
    int                                                  1 / 1
    long                                                 1 / 1
    half                                                 0 / 0        (n/a)
    float                                                1 / 1
    double                                               1 / 1        (cl_khr_fp64)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  Yes
  Double-precision Floating-point support         (cl_khr_fp64)
    Denormals                                     Yes
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 Yes
    Round to infinity                             Yes
    IEEE754-2008 fused multiply-add               Yes
    Support is emulated in software               No
  Address bits                                    32, Little-Endian
  Global memory size                              4278190080 (3.984GiB)
  Error Correction support                        No
  Max memory allocation                           1069547520 (1020MiB)
  Unified memory for Host and Device              No
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       32768 bits (4096 bytes)
  Global Memory cache type                        None
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            134217728 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   256 bytes
    Pitch alignment for 2D image buffers          256 pixels
    Max 2D image size                             16384x16384 pixels
    Max 3D image size                             2048x2048x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max number of constant args                     8
  Max constant buffer size                        65536 (64KiB)
  Max size of kernel argument                     1024
  Queue properties
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      37ns
  Execution capabilities
    Run OpenCL kernels                            Yes
    Run native kernels                            No
  printf() buffer size                            134217728 (128MiB)
  Built-in kernels                                (n/a)
  Device Extensions                               cl_APPLE_SetMemObjectDestructor cl_APPLE_ContextLoggingFunctions cl_APPLE_clut cl_APPLE_query_kernel_names cl_APPLE_gl_sharing cl_khr_gl_event cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_image2d_from_buffer cl_khr_depth_images cl_APPLE_command_queue_priority cl_APPLE_command_queue_select_compute_units cl_khr_fp64

Note: Out-of-order execution No

@TheBloke
Copy link
Contributor Author

Yeah I'm still getting that long message about 'automatic variable qualified with an address space' as well

<program source>:3:2247: error: automatic variable qualified with an address space
typedef uchar uint8_t; typedef int int32_t; typedef uint uint32_t; struct block_q4_0 { float d; uint8_t qs[16]; }; struct block_q4_1 { float d; float m; uint8_t qs[16]; }; struct __attribute__ ((packed)) block_q5_0 { half d; uint32_t qh; uint8_t qs[16]; }; struct block_q5_1 { half d; half m; uint32_t qh; uint8_t qs[16]; }; struct block_q8_0 { float d; uint8_t qs[16]; }; __kernel void dequantize_row_q4_0(__global struct block_q4_0* x, __global float* y) { constant uint qk = 32; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; const int x0 = (x[i].qs[j] & 0xf) - 8; const int x1 = (x[i].qs[j] >> 4) - 8; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q4_1(__global struct block_q4_1* x, __global float* y) { constant uint qk = 32; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; const float m = x[i].m; const int x0 = (x[i].qs[j] & 0xf); const int x1 = (x[i].qs[j] >> 4); y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q5_0(__global struct block_q5_0* x, __global float* y) { constant uint qk = 32; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int32_t x0 = ((x[i].qs[j] & 0xf) | xh_0) - 16; const int32_t x1 = ((x[i].qs[j] >> 4) | xh_1) - 16; y[i*qk + j + 0 ] = x0*d; y[i*qk + j + qk/2] = x1*d; } __kernel void dequantize_row_q5_1(__global struct block_q5_1* x, __global float* y) { constant uint qk = 32; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = vload_half(0, (__global half*) &x[i].d); const float m = vload_half(0, (__global half*) &x[i].m); uint32_t qh = x[i].qh; const uint8_t xh_0 = ((qh >> (j + 0)) << 4) & 0x10; const uint8_t xh_1 = ((qh >> (j + 12)) ) & 0x10; const int x0 = (x[i].qs[j] & 0xf) | xh_0; const int x1 = (x[i].qs[j] >> 4) | xh_1; y[i*qk + j + 0 ] = x0*d + m; y[i*qk + j + qk/2] = x1*d + m; } __kernel void dequantize_row_q8_0(__global struct block_q8_0* x, __global float* y) { constant uint qk = 32; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }

@swittk
Copy link
Contributor

swittk commented May 14, 2023

idk if it helps, but I tried slicing the error message string, and the string from character 2247 onwards is the following
'qk = 32; const uint i = get_global_id(0) / qk; const uint j = get_local_id(0); const float d = x[i].d; y[i*qk + j] = x[i].qs[j]*d; }'
I searched and it seems that qk is the only variable in the whole file declared as constant rather than const, so maybe that's the issue? (I'm not sure if I understand correctly; correct me if I'm wrong. But I read the OpenCL docs and as far as I understand constant is used to declare global variables, so maybe the compiler's complaining that we're declaring constant variables in the context of variables inside functions?)

Edit : I changed the declarations from constant uint qk = ... to const uint qk = ... and it appears to compile and run fine for me on my Apple Silicon MacBook Pro; I'll test on my hack with an AMD GPU to see if it works.

Edit 2 : The change from constant to const worked :)

@SlyEcho
Copy link
Collaborator

SlyEcho commented May 14, 2023

declared as constant rather than const

That should be the problem. It is now fixed in #1435, please test it and give feedback.

@TheBloke
Copy link
Contributor Author

declared as constant rather than const

That should be the problem. It is now fixed in #1435, please test it and give feedback.

Problem is fixed by #1435 , thank you very much!

@Thireus
Copy link

Thireus commented May 18, 2023

Any perf improvements over CPU? How does it compare against cublast? I'm also on Hackintosh.

@TheBloke
Copy link
Contributor Author

Yeah I saw improvements in the prompt generation time, I think it was about half. I didn't do any proper benchmarks and I've not compared against CUBLAS.

Right now it doesn't support the full gpu offloading that's now available with CUBLAS, so it's not going to be the same huge boost as that's provided. But definitely nice to have in situations where you have long context, eg ongoing chats, I would say.

And just really nice to see the GPU being used in macOS! :) (well, TBF not the first time, as WebLLM/mlc can do that too - but that has very limited model support right now.)

@Thireus
Copy link

Thireus commented May 18, 2023

"nice to see the GPU being used in macOS" -> definitely... tell me about it! Main use case is 1% usage when moving windows around. 😂

Thanks for the feedback, I'll give it a go. I tried to compile it a few days ago but also miserably failed.

@arch-btw
Copy link
Contributor

arch-btw commented May 27, 2023

I'm still getting the error:

CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE)

@swittk do you know how I can apply this workaround to the new ggml-opencl.cpp file?

queue = clCreateCommandQueue(context, device, 0, &err);

Seems like it's on line 490 now but not sure what to change. Thank you.

@lopagela
Copy link

-DLLAMA_HIPBLAS=on gave me very good results so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants