Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

iqk_mul_mat: better srategy when nrc_y not divisible by ny #71

Merged
merged 1 commit into from
Oct 1, 2024

Conversation

ikawrakow
Copy link
Owner

In the llamafile repository @Djip007 has posted PP results for short prompt lengths in steps of 1, and one sees a sharp drop in performance for 9 tokens for Q6_K and Q5_K_M. Why? For these quants llamafile uses iqk_mul_mat that I have contributed there, so the matrix multiplication is done using 1x8 tiles. The way it is implemented there (and also here on the main branch) is that first we multiply with 8 columns from the right matrix and then have a second pass to multiple with the remaining 9th column. This second pass is much slower, so overall performance drops. I was of course aware that there will be this effect, and always meant to investigate it, but never did. Now that we have it published, it is time to fix it via this PR.

When the number of columns N in the right matrix is not divisible by the maximum tile size n_max, a better strategy for performing the matrix multiplication is this:

  • M = (N + n_max - 1)/n_max is the number of passes we need for the full matrix multiplication (loops over B-columns tiles)
  • Let n = N/M (integer division). We will take m passes with a tile size of n, and (M - m) passes with a tile size of n+1
  • n*m + (n+1)*(M-m) must be equal N, so we get m = M * (n+1) - N

This strategy is implemented in this PR. The following graph shows performance (tokens per second) for LLaMA-3.2-3B as a function of prompt length for the main branch (black) and this PR (red). This is for a bf16 model where the tile size is 5 x 5, so we see the main branch being equivalent to this PR for prompt length <= 5 (single pass) and then for 10, 15, 20, 25 and 30 tokens, but being significantly lower for prompt lengths that are not a multiple of 5. The PR shows a nice smooth increase in performance as one would expect.

iqk_strategy

@ikawrakow ikawrakow merged commit 8cba478 into main Oct 1, 2024
@Djip007
Copy link

Djip007 commented Nov 26, 2024

I what thinking to do something for that to (on tinyBLAS) but not that way. Good to see that it work, I may use it in some other case...
Good JOB!

Will you do the same on tinyBLAS for non the other case (FP16/BF16/...) ?

@ikawrakow
Copy link
Owner Author

Will you do the same on tinyBLAS for non the other case (FP16/BF16/...) ?

In my case all matrix multiplications are driven by the same function, so this change benefits all types. I think in tinyBLAS one needs to do it for every version of mnpack

Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 8, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 8, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 8, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 8, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 8, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
@Djip007
Copy link

Djip007 commented Dec 9, 2024

OK I think I figure how to do it for FP16/BF16/FP32 on tinyblas...
Mozilla-Ocho/llamafile#654

some bench are WIP but for now it look good.

Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 9, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Djip007 added a commit to Djip007/llamafile that referenced this pull request Dec 11, 2024
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- more cache freindly
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 11, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 11, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 11, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
Djip007 added a commit to Djip007/llamafile that referenced this pull request Dec 11, 2024
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- more cache freindly
Djip007 added a commit to Djip007/llama.cpp that referenced this pull request Dec 22, 2024
- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly
slaren pushed a commit to ggerganov/llama.cpp that referenced this pull request Dec 24, 2024
* more perfo with llamafile tinyblas on x86_64.

- add bf16 suport
- change dispache strategie (thanks:
ikawrakow/ik_llama.cpp#71 )
- reduce memory bandwidth

simple tinyblas dispache and more cache freindly

* tinyblas dynamic dispaching

* sgemm: add M blocs.

* - git 2.47 use short id of len 9.
- show-progress is not part of GNU Wget2

* remove not stable test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants