Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413

3ooabkhxtn · 2023-05-12T09:58:54Z

Hi, I am unfortunately on of the persons who have a computer without any AVX instructions, only sse instrucions.

So the runtimes to calculate the tokens are very bad.

For that I implemented ggml_vec_dot_q4_0_q8_0 with SSE instructions to gain some performance. I started with taking the avx parts and replace all 256 vector instructions with 128 vector instructions. After that I did some additional improvements to squeeze some more performance out of the routine.

Run command was

./main -s 1 -t 4 -n 128 ../models/7B/ggml-model-q4_0.bin -p "What is the meaning of life?"

What I started with without change at the code:

llama_print_timings: load time = 14251.20 ms
llama_print_timings: sample time = 129.15 ms / 128 runs ( 1.01 ms per token)
llama_print_timings: prompt eval time = 14050.58 ms / 8 tokens ( 1756.32 ms per token)
llama_print_timings: eval time = 238504.60 ms / 127 runs ( 1877.99 ms per token)
llama_print_timings: total time = 252916.56 ms

What I got after just replacing the avx 256 instructions with sse 128 instructions:

llama_print_timings: load time = 3349.09 ms
llama_print_timings: sample time = 53.06 ms / 52 runs ( 1.02 ms per token)
llama_print_timings: prompt eval time = 3154.19 ms / 8 tokens ( 394.27 ms per token)
llama_print_timings: eval time = 23759.20 ms / 51 runs ( 465.87 ms per token)
llama_print_timings: total time = 27174.93 ms

What I got after squeezing:

llama_print_timings: load time = 2899.92 ms
llama_print_timings: sample time = 127.62 ms / 128 runs ( 1.00 ms per token)
llama_print_timings: prompt eval time = 2705.68 ms / 8 tokens ( 338.21 ms per token)
llama_print_timings: eval time = 52500.58 ms / 127 runs ( 413.39 ms per token)
llama_print_timings: total time = 55559.90 ms

Esspecially the squeezing part is just a feeling, as I would expect the compiler should do much of the improvements on its own, for example the prefetching. But all in all it feels like I get more performance, when I tell it more specific, what I want.

Reference: llama_print_timings: load time = 14251.20 ms llama_print_timings: sample time = 129.15 ms / 128 runs ( 1.01 ms per token) llama_print_timings: prompt eval time = 14050.58 ms / 8 tokens ( 1756.32 ms per token) llama_print_timings: eval time = 238504.60 ms / 127 runs ( 1877.99 ms per token) llama_print_timings: total time = 252916.56 ms SSE3 instructions llama_print_timings: load time = 3349.09 ms llama_print_timings: sample time = 53.06 ms / 52 runs ( 1.02 ms per token) llama_print_timings: prompt eval time = 3154.19 ms / 8 tokens ( 394.27 ms per token) llama_print_timings: eval time = 23759.20 ms / 51 runs ( 465.87 ms per token) llama_print_timings: total time = 27174.93 ms

…r to optimise - Accumulate two acc instead of one llama_print_timings: load time = 3137.95 ms llama_print_timings: sample time = 132.54 ms / 128 runs ( 1.04 ms per token) llama_print_timings: prompt eval time = 2943.22 ms / 8 tokens ( 367.90 ms per token) llama_print_timings: eval time = 59539.50 ms / 127 runs ( 468.81 ms per token) llama_print_timings: total time = 62843.23 ms

- Removed first accumulation ideas taken from here https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/ llama_print_timings: load time = 3087.59 ms llama_print_timings: sample time = 132.04 ms / 128 runs ( 1.03 ms per token) llama_print_timings: prompt eval time = 2894.28 ms / 8 tokens ( 361.78 ms per token) llama_print_timings: eval time = 58529.67 ms / 127 runs ( 460.86 ms per token) llama_print_timings: total time = 61780.98 ms

llama_print_timings: load time = 3021.72 ms llama_print_timings: sample time = 128.90 ms / 128 runs ( 1.01 ms per token) llama_print_timings: prompt eval time = 2826.35 ms / 8 tokens ( 353.29 ms per token) llama_print_timings: eval time = 53198.13 ms / 127 runs ( 418.88 ms per token) llama_print_timings: total time = 56380.69 ms

llama_print_timings: load time = 2899.92 ms llama_print_timings: sample time = 127.62 ms / 128 runs ( 1.00 ms per token) llama_print_timings: prompt eval time = 2705.68 ms / 8 tokens ( 338.21 ms per token) llama_print_timings: eval time = 52500.58 ms / 127 runs ( 413.39 ms per token) llama_print_timings: total time = 55559.90 ms

sw · 2023-05-12T11:30:47Z

The Windows builds are failing, I think you need to use defined:

#if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__) || defined(__SSE3__)

3ooabkhxtn · 2023-05-12T13:33:49Z

defined(__SSE3__)

Only put the SSE3 into defined() as I don't want to touch the other parts of the code, as I not know what is intended. Let's see if it fixes the problem.

sw · 2023-05-12T13:51:22Z

It's fine to use defined() for all the variants, but it should actually be __SSSE3__ (note: three S), because _mm_maddubs_epi16 is only supported by Supplemental SSE3.

- Use __SSSE3__ instead of __SSE__

github-actions

clang-tidy made some suggestions

ggml.c

rankaiyx · 2023-05-30T14:03:46Z

How can I compile the program that uses only SSE3 instructions for reasoning on a Linux computer with avx2 in CPU?

Is it okay to modify the Makefile like this?

# Use all CPU extensions that are available:
#CFLAGS   += -march=native -mtune=native
#CXXFLAGS += -march=native -mtune=native

# Usage AVX-only
#CFLAGS   += -mfma -mf16c -mavx
#CXXFLAGS += -mfma -mf16c -mavx

CFLAGS   += -msse3
CXXFLAGS += -msse3

3ooabkhxtn · 2023-05-31T09:00:43Z

Should do it. But I think you need -mssse3 as we check for SSSE3

rankaiyx · 2023-05-31T09:18:49Z

It's really ssse3, and this has an accelerating effect. But sse3 can also make the compilation flag sse3 display as 1, even if there is no acceleration effect.
SSE3:
system_info: n_threads = 30 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
SSSE3:
system_info: n_threads = 30 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

3ooabkhxtn added 5 commits May 12, 2023 07:54

- Cleanup

8699fd0

3ooabkhxtn changed the title ~~Adding SSES instructions to ggml_vec_dot_q4_0_q8_0~~ Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 May 12, 2023

Put __SSE3__ into defined()

70c2b6c

- Put the whole line into defined()

fc26f54

- Use __SSSE3__ instead of __SSE__

github-actions bot reviewed May 12, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

- rearranged defines, SSSE3 function only compiled if used

25b448a

sw approved these changes May 13, 2023

View reviewed changes

sw merged commit ac0cd25 into ggml-org:master May 13, 2023

rankaiyx mentioned this pull request Jun 1, 2023

Update Makefile to add SSSE3 compilation use cases #1659

Merged

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413

Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413

3ooabkhxtn commented May 12, 2023 •

edited

Loading

sw commented May 12, 2023

3ooabkhxtn commented May 12, 2023

sw commented May 12, 2023 •

edited

Loading

github-actions bot left a comment

rankaiyx commented May 30, 2023 •

edited

Loading

3ooabkhxtn commented May 31, 2023

rankaiyx commented May 31, 2023 •

edited

Loading

Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413

Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413

Conversation

3ooabkhxtn commented May 12, 2023 • edited Loading

sw commented May 12, 2023

3ooabkhxtn commented May 12, 2023

sw commented May 12, 2023 • edited Loading

github-actions bot left a comment

Choose a reason for hiding this comment

rankaiyx commented May 30, 2023 • edited Loading

3ooabkhxtn commented May 31, 2023

rankaiyx commented May 31, 2023 • edited Loading

3ooabkhxtn commented May 12, 2023 •

edited

Loading

sw commented May 12, 2023 •

edited

Loading

rankaiyx commented May 30, 2023 •

edited

Loading

rankaiyx commented May 31, 2023 •

edited

Loading