-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLBlast: Support broadcasting for matrix multiplication and GQA #3402
Conversation
Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look into the OpenCL backend.
My impression is that lately it has been getting less and less attention, so it is possible that there are multiple issues as you describe.
Would be great to fix these and this looks like a good step
I've done some more testing, and my confidence in correctness of results increased. If anyone wants to review code style and commit message, please do. If there are no suggestions on that front, then this can be merged. |
I can take a look later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thank you for fixing this issue.
It's quite an awesome little addition. I noticed the lack of broadcasting support in ggml-cuda, I tried to add it quite similar to this patch (just in ggml-cuda it's in the op() function as it has no dedicated multiplication one) but the resulting performance was a fraction of the CPU speed (multi GPU test). (I should add that cuda/cublas attempt was ~2 months ago, maybe the underlying code has change a lot but it didn't look so at first glance) |
I made it so that it loops over dimensions 2 and 3 of src1 and computes corresponding coordinates in src0. But it is better to have outer loops over src0 and inner loops over src1. |
…example * 'master' of github.com:ggerganov/llama.cpp: (24 commits) convert : fix Baichuan2 models by using vocab size in config.json (ggerganov#3299) readme : add project status link ggml : fix build after ggerganov#3329 llm : add Refact model (ggerganov#3329) sync : ggml (conv 1d + 2d updates, UB fixes) (ggerganov#3468) finetune : readme fix typo (ggerganov#3465) ggml : add RISC-V Vector Support for K-Quants and improved the existing intrinsics (ggerganov#3453) main : consistent prefix/suffix coloring (ggerganov#3425) llama : fix session saving/loading (ggerganov#3400) llama : expose model's rope_freq_scale in the API (ggerganov#3418) metal : alibi for arbitrary number of heads (ggerganov#3426) cmake : make LLAMA_NATIVE flag actually use the instructions supported by the processor (ggerganov#3273) Work on the BPE tokenizer (ggerganov#3252) convert : fix vocab size when not defined in hparams (ggerganov#3421) cmake : increase minimum version for add_link_options (ggerganov#3444) CLBlast: Add broadcast support for matrix multiplication (ggerganov#3402) gguf : add BERT, MPT, and GPT-J arch info (ggerganov#3408) gguf : general usability improvements (ggerganov#3409) cmake : make CUDA flags more similar to the Makefile (ggerganov#3420) finetune : fix ggerganov#3404 (ggerganov#3437) ...
Broadcast src0 into src1 across dimensions 2 and 3 when needed, in line with other back-ends.
Fixes #3002
Fixes #3263
Successfully tested with Llama 2 models, both GQA and non-GQA. I've also done some limited testing of matrix multiplication in isolation. However, a number of existing bugs in ggml-opencl.cpp interferes with my testing:
Despite that, I believe that this change does not break inference with previously supported models, and newly enabled functionality is on par with existing.