-
Notifications
You must be signed in to change notification settings - Fork 11.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413
Conversation
Reference: llama_print_timings: load time = 14251.20 ms llama_print_timings: sample time = 129.15 ms / 128 runs ( 1.01 ms per token) llama_print_timings: prompt eval time = 14050.58 ms / 8 tokens ( 1756.32 ms per token) llama_print_timings: eval time = 238504.60 ms / 127 runs ( 1877.99 ms per token) llama_print_timings: total time = 252916.56 ms SSE3 instructions llama_print_timings: load time = 3349.09 ms llama_print_timings: sample time = 53.06 ms / 52 runs ( 1.02 ms per token) llama_print_timings: prompt eval time = 3154.19 ms / 8 tokens ( 394.27 ms per token) llama_print_timings: eval time = 23759.20 ms / 51 runs ( 465.87 ms per token) llama_print_timings: total time = 27174.93 ms
…r to optimise - Accumulate two acc instead of one llama_print_timings: load time = 3137.95 ms llama_print_timings: sample time = 132.54 ms / 128 runs ( 1.04 ms per token) llama_print_timings: prompt eval time = 2943.22 ms / 8 tokens ( 367.90 ms per token) llama_print_timings: eval time = 59539.50 ms / 127 runs ( 468.81 ms per token) llama_print_timings: total time = 62843.23 ms
- Removed first accumulation ideas taken from here https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/ llama_print_timings: load time = 3087.59 ms llama_print_timings: sample time = 132.04 ms / 128 runs ( 1.03 ms per token) llama_print_timings: prompt eval time = 2894.28 ms / 8 tokens ( 361.78 ms per token) llama_print_timings: eval time = 58529.67 ms / 127 runs ( 460.86 ms per token) llama_print_timings: total time = 61780.98 ms
llama_print_timings: load time = 3021.72 ms llama_print_timings: sample time = 128.90 ms / 128 runs ( 1.01 ms per token) llama_print_timings: prompt eval time = 2826.35 ms / 8 tokens ( 353.29 ms per token) llama_print_timings: eval time = 53198.13 ms / 127 runs ( 418.88 ms per token) llama_print_timings: total time = 56380.69 ms
llama_print_timings: load time = 2899.92 ms llama_print_timings: sample time = 127.62 ms / 128 runs ( 1.00 ms per token) llama_print_timings: prompt eval time = 2705.68 ms / 8 tokens ( 338.21 ms per token) llama_print_timings: eval time = 52500.58 ms / 127 runs ( 413.39 ms per token) llama_print_timings: total time = 55559.90 ms
The Windows builds are failing, I think you need to use #if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__) || defined(__SSE3__)
|
Only put the SSE3 into defined() as I don't want to touch the other parts of the code, as I not know what is intended. Let's see if it fixes the problem. |
It's fine to use |
- Use __SSSE3__ instead of __SSE__
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
How can I compile the program that uses only SSE3 instructions for reasoning on a Linux computer with avx2 in CPU? Is it okay to modify the Makefile like this?
|
Should do it. But I think you need -mssse3 as we check for SSSE3 |
It's really ssse3, and this has an accelerating effect. But sse3 can also make the compilation flag sse3 display as 1, even if there is no acceleration effect. |
Hi, I am unfortunately on of the persons who have a computer without any AVX instructions, only sse instrucions.
So the runtimes to calculate the tokens are very bad.
For that I implemented ggml_vec_dot_q4_0_q8_0 with SSE instructions to gain some performance. I started with taking the avx parts and replace all 256 vector instructions with 128 vector instructions. After that I did some additional improvements to squeeze some more performance out of the routine.
Run command was
./main -s 1 -t 4 -n 128 ../models/7B/ggml-model-q4_0.bin -p "What is the meaning of life?"
What I started with without change at the code:
llama_print_timings: load time = 14251.20 ms
llama_print_timings: sample time = 129.15 ms / 128 runs ( 1.01 ms per token)
llama_print_timings: prompt eval time = 14050.58 ms / 8 tokens ( 1756.32 ms per token)
llama_print_timings: eval time = 238504.60 ms / 127 runs ( 1877.99 ms per token)
llama_print_timings: total time = 252916.56 ms
What I got after just replacing the avx 256 instructions with sse 128 instructions:
llama_print_timings: load time = 3349.09 ms
llama_print_timings: sample time = 53.06 ms / 52 runs ( 1.02 ms per token)
llama_print_timings: prompt eval time = 3154.19 ms / 8 tokens ( 394.27 ms per token)
llama_print_timings: eval time = 23759.20 ms / 51 runs ( 465.87 ms per token)
llama_print_timings: total time = 27174.93 ms
What I got after squeezing:
llama_print_timings: load time = 2899.92 ms
llama_print_timings: sample time = 127.62 ms / 128 runs ( 1.00 ms per token)
llama_print_timings: prompt eval time = 2705.68 ms / 8 tokens ( 338.21 ms per token)
llama_print_timings: eval time = 52500.58 ms / 127 runs ( 413.39 ms per token)
llama_print_timings: total time = 55559.90 ms
Esspecially the squeezing part is just a feeling, as I would expect the compiler should do much of the improvements on its own, for example the prefetching. But all in all it feels like I get more performance, when I tell it more specific, what I want.