-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
iqk_mul_mat: better srategy when nrc_y not divisible by ny #71
Conversation
I what thinking to do something for that to (on tinyBLAS) but not that way. Good to see that it work, I may use it in some other case... Will you do the same on tinyBLAS for non the other case (FP16/BF16/...) ? |
In my case all matrix multiplications are driven by the same function, so this change benefits all types. I think in tinyBLAS one needs to do it for every version of |
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
OK I think I figure how to do it for FP16/BF16/FP32 on tinyblas... some bench are WIP but for now it look good. |
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
- change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - more cache freindly
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
- change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - more cache freindly
- add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly
* more perfo with llamafile tinyblas on x86_64. - add bf16 suport - change dispache strategie (thanks: ikawrakow/ik_llama.cpp#71 ) - reduce memory bandwidth simple tinyblas dispache and more cache freindly * tinyblas dynamic dispaching * sgemm: add M blocs. * - git 2.47 use short id of len 9. - show-progress is not part of GNU Wget2 * remove not stable test
In the llamafile repository @Djip007 has posted PP results for short prompt lengths in steps of 1, and one sees a sharp drop in performance for 9 tokens for
Q6_K
andQ5_K_M
. Why? For these quants llamafile usesiqk_mul_mat
that I have contributed there, so the matrix multiplication is done using 1x8 tiles. The way it is implemented there (and also here on the main branch) is that first we multiply with 8 columns from the right matrix and then have a second pass to multiple with the remaining 9th column. This second pass is much slower, so overall performance drops. I was of course aware that there will be this effect, and always meant to investigate it, but never did. Now that we have it published, it is time to fix it via this PR.When the number of columns
N
in the right matrix is not divisible by the maximum tile sizen_max
, a better strategy for performing the matrix multiplication is this:M = (N + n_max - 1)/n_max
is the number of passes we need for the full matrix multiplication (loops over B-columns tiles)n = N/M
(integer division). We will takem
passes with a tile size ofn
, and(M - m)
passes with a tile size ofn+1
n*m + (n+1)*(M-m)
must be equalN
, so we getm = M * (n+1) - N
This strategy is implemented in this PR. The following graph shows performance (tokens per second) for LLaMA-3.2-3B as a function of prompt length for the main branch (black) and this PR (red). This is for a
bf16
model where the tile size is5 x 5
, so we see the main branch being equivalent to this PR for prompt length <= 5 (single pass) and then for 10, 15, 20, 25 and 30 tokens, but being significantly lower for prompt lengths that are not a multiple of 5. The PR shows a nice smooth increase in performance as one would expect.