-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize rope function to avoid call powf in the loop #807
Conversation
Can you re-run the tests a few times (like a total 3–5 for each) and confirm that the improvement is not just a result of random noise in the amount of time the process takes to complete? |
yes, I actually did that already. the above is one sample: Before:
After:
|
I've also gave it a try. I've used python to loop over the configurations, using 8 16 and 24 threads. Each test was evaluated 6 times with I can set the loop for a larger |
If you can share your python script, it will be great. It is a good tool to help some performance experimenting. |
Sure, as a warning it's just a simple loop 😅, you can find the code here --> benchmark_pr_llama.py |
Are we confident that this iterative multiplication is numerically the same as |
yes this is not approximate. this is just a math equation. |
This slightly improves the perf.
Before:
llama_print_timings: load time = 6973.64 ms
llama_print_timings: sample time = 345.88 ms / 128 runs ( 2.70 ms per run)
llama_print_timings: prompt eval time = 6227.73 ms / 29 tokens ( 214.75 ms per token)
llama_print_timings: eval time = 34661.36 ms / 127 runs ( 272.92 ms per run)
llama_print_timings: total time = 41992.88 ms
After:
llama_print_timings: load time = 7668.98 ms
llama_print_timings: sample time = 345.03 ms / 128 runs ( 2.70 ms per run)
llama_print_timings: prompt eval time = 6906.97 ms / 29 tokens ( 238.17 ms per token)
llama_print_timings: eval time = 33445.33 ms / 127 runs ( 263.35 ms per run)
llama_print_timings: total time = 41471.40 ms