-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid heavy V transpose operation + improvements #775
Conversation
ggml : - added ggml_view_3d() - ggml_view_tensor() now inherits the stride too - reimplement ggml_cpy() to account for dst stride - no longer require tensor->data to be memory aligned llama : - compute RoPE on 32-bit tensors (should be more accurate) - store RoPE-ed K in the KV cache - store transposed V in the KV cache (significant speed-up) - avoid unnecessary Q copy
Changes output, but after several runs the uplift in performance seems promising. Best results after running
PR
Bonus result of this PR combined with #768.
So roughly 30ms faster. Then 40-45ms faster with #768. Though not all who tested that PR got the same uplifts I did strangely enough. Would probably be worth checking perplexity. |
|
||
// KQV = transpose(V) * KQ_soft_max | ||
struct ggml_tensor * KQV = ggml_mul_mat(ctx0, V_trans, KQ_soft_max); | ||
#if 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we check n_token == 1 to take the different branch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few ideas that might fix this without need to branch here. Will try them first
I will test further later on, but I already want to post some intermediate results. The speed is constant throughout the printing, i.e. the performance degradation has been resolved on my pc (windows/cmake/vs). In attachment you can find the result of The output reached 1399 words (counted by MS Word) and then halted. After some time I just used ctrl+c to exit. I've tried some other options but many times the output simply places So I will try to reproduce the behavior of the one run where it got stuck. All other runs produced a shorter text and exit cleanly. |
@KASR |
Yes indeed 😅 The output preserves the performance, I reduced Timings for one of the long runs:
I also did a quick test with various threads, the output remains consistent. See attachment: output_llama_fix_cpy_various_threads.txt So it appears the suggested changes resolve #603 (at least on my pc / configuration) and fixing the issues that #439 tackled. |
Thanks Georgi! Awesome work! And thanks to everyone else who helped narrow down and evaluate performance along the way! Was neat to see everybody come together to help figure this out ❤️ Dang, that GitHub AI assisted PR 🔥🔥😂 |
Human generated notes
I believe this should resolve: #603 #677 #767
When I deprecated the non-contiguous
ggml_mul_mat()
branch in #439, I grossly underestimated the cost of transposing the V matrix on every token. It's a very heavy memory operation, with tons of cache misses.To solve this, we now store V in the KV cache in transposed state, so we don't need to do it for future tokens.
ggml :
llama :
🤖 Generated by Copilot at 1868f6c
Summary
✨⚡📈
This pull request enhances the ggml library and the llama project by adding new tensor operations, optimizing existing ones, and using RoPE for self-attention. It also improves the debuggability of the llama project by adding optional timing and plotting code. The main files affected are
ggml.c
,ggml.h
, andllama.cpp
.Walkthrough
llama_eval_internal
by using RoPE and view tensors (link, link)ggml_view_3d
to create 3-dimensional view tensors from source tensors (link, link)ggml_new_tensor_impl
to speed up tensor creation (link)ggml_view_tensor
to preserve the source tensor layout (link)llama_eval_internal
(link)Edit:
Merging this since we observe M1 and Windows working good.
If you notice something not right, feel free to revert. But I think it should be good