-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml : multi-thread ggml_rope() (~3-4 times faster on M1) #781
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested on M1.
Before the change (master 5a8c4f6) - second run:
llama_print_timings: load time = 2294.47 ms
llama_print_timings: sample time = 116.82 ms / 128 runs ( 0.91 ms per run)
llama_print_timings: prompt eval time = 2090.16 ms / 8 tokens ( 261.27 ms per token)
llama_print_timings: eval time = 23215.24 ms / 127 runs ( 182.80 ms per run)
llama_print_timings: total time = 25628.04 ms
After the change (commit 625f212) - second run:
llama_print_timings: load time = 1460.12 ms
llama_print_timings: sample time = 94.96 ms / 128 runs ( 0.74 ms per run)
llama_print_timings: prompt eval time = 1272.30 ms / 8 tokens ( 159.04 ms per token)
llama_print_timings: eval time = 21661.06 ms / 127 runs ( 170.56 ms per run)
llama_print_timings: total time = 23217.51 ms
for (int64_t i3 = 0; i3 < ne3; i3++) { | ||
for (int64_t i2 = (mode == 0 ? 0 : n_past); i2 < ne2; i2++) { | ||
const int p = (mode == 0 ? n_past + i2 : i2); | ||
for (int64_t i1 = 0; i1 < ne1; i1++) { | ||
if (ir++ < ir0) continue; | ||
if (ir > ir1) break; | ||
|
||
for (int i0 = 0; i0 < n_dims; i0 += 2) { | ||
const float theta = powf(10000.0, ((float)-i0)/n_dims); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
theta can be calculated as theta *= factor in each loop. factor can be calculated out of the outer loop factor = powf(10000.0, ((float)-2)/n_dims); the initial theta = p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open a PR if you observe performance gain improvement
How can i use Multi thread llama on cpu? |
π€ Generated by Copilot at 625f212
Summary
π§Άππ§βπ»
This pull request improves the performance of the
ggml
library by parallelizing the rope operation on tensors, and modifying the graph executor to handle parallel tasks. It affects the fileggml.c
.Walkthrough
f32
andf16
data types by dividing the input rows among the available threads (link, link, link, link)ggml_compute_forward_rope_f32
andggml_compute_forward_rope_f16
functions (link, link)ggml.c
file (link)