Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 #1413

Merged
merged 9 commits into from
May 13, 2023

Conversation

3ooabkhxtn
Copy link
Contributor

@3ooabkhxtn 3ooabkhxtn commented May 12, 2023

Hi, I am unfortunately on of the persons who have a computer without any AVX instructions, only sse instrucions.

So the runtimes to calculate the tokens are very bad.

For that I implemented ggml_vec_dot_q4_0_q8_0 with SSE instructions to gain some performance. I started with taking the avx parts and replace all 256 vector instructions with 128 vector instructions. After that I did some additional improvements to squeeze some more performance out of the routine.

Run command was

./main -s 1 -t 4 -n 128 ../models/7B/ggml-model-q4_0.bin -p "What is the meaning of life?"

What I started with without change at the code:

llama_print_timings: load time = 14251.20 ms
llama_print_timings: sample time = 129.15 ms / 128 runs ( 1.01 ms per token)
llama_print_timings: prompt eval time = 14050.58 ms / 8 tokens ( 1756.32 ms per token)
llama_print_timings: eval time = 238504.60 ms / 127 runs ( 1877.99 ms per token)
llama_print_timings: total time = 252916.56 ms

What I got after just replacing the avx 256 instructions with sse 128 instructions:

llama_print_timings: load time = 3349.09 ms
llama_print_timings: sample time = 53.06 ms / 52 runs ( 1.02 ms per token)
llama_print_timings: prompt eval time = 3154.19 ms / 8 tokens ( 394.27 ms per token)
llama_print_timings: eval time = 23759.20 ms / 51 runs ( 465.87 ms per token)
llama_print_timings: total time = 27174.93 ms

What I got after squeezing:

llama_print_timings: load time = 2899.92 ms
llama_print_timings: sample time = 127.62 ms / 128 runs ( 1.00 ms per token)
llama_print_timings: prompt eval time = 2705.68 ms / 8 tokens ( 338.21 ms per token)
llama_print_timings: eval time = 52500.58 ms / 127 runs ( 413.39 ms per token)
llama_print_timings: total time = 55559.90 ms

Esspecially the squeezing part is just a feeling, as I would expect the compiler should do much of the improvements on its own, for example the prefetching. But all in all it feels like I get more performance, when I tell it more specific, what I want.

3ooabkhxtn added 5 commits May 12, 2023 07:54
Reference:

llama_print_timings:        load time = 14251.20 ms
llama_print_timings:      sample time =   129.15 ms /   128 runs   (    1.01 ms per token)
llama_print_timings: prompt eval time = 14050.58 ms /     8 tokens ( 1756.32 ms per token)
llama_print_timings:        eval time = 238504.60 ms /   127 runs   ( 1877.99 ms per token)
llama_print_timings:       total time = 252916.56 ms

SSE3 instructions

llama_print_timings:        load time =  3349.09 ms
llama_print_timings:      sample time =    53.06 ms /    52 runs   (    1.02 ms per token)
llama_print_timings: prompt eval time =  3154.19 ms /     8 tokens (  394.27 ms per token)
llama_print_timings:        eval time = 23759.20 ms /    51 runs   (  465.87 ms per token)
llama_print_timings:       total time = 27174.93 ms
…r to optimise

- Accumulate two acc instead of one

llama_print_timings:        load time =  3137.95 ms
llama_print_timings:      sample time =   132.54 ms /   128 runs   (    1.04 ms per token)
llama_print_timings: prompt eval time =  2943.22 ms /     8 tokens (  367.90 ms per token)
llama_print_timings:        eval time = 59539.50 ms /   127 runs   (  468.81 ms per token)
llama_print_timings:       total time = 62843.23 ms
- Removed first accumulation

ideas taken from here https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/

llama_print_timings:        load time =  3087.59 ms
llama_print_timings:      sample time =   132.04 ms /   128 runs   (    1.03 ms per token)
llama_print_timings: prompt eval time =  2894.28 ms /     8 tokens (  361.78 ms per token)
llama_print_timings:        eval time = 58529.67 ms /   127 runs   (  460.86 ms per token)
llama_print_timings:       total time = 61780.98 ms
llama_print_timings:        load time =  3021.72 ms
llama_print_timings:      sample time =   128.90 ms /   128 runs   (    1.01 ms per token)
llama_print_timings: prompt eval time =  2826.35 ms /     8 tokens (  353.29 ms per token)
llama_print_timings:        eval time = 53198.13 ms /   127 runs   (  418.88 ms per token)
llama_print_timings:       total time = 56380.69 ms
@3ooabkhxtn 3ooabkhxtn changed the title Adding SSES instructions to ggml_vec_dot_q4_0_q8_0 Adding SSE instructions to ggml_vec_dot_q4_0_q8_0 May 12, 2023
llama_print_timings:        load time =  2899.92 ms
llama_print_timings:      sample time =   127.62 ms /   128 runs   (    1.00 ms per token)
llama_print_timings: prompt eval time =  2705.68 ms /     8 tokens (  338.21 ms per token)
llama_print_timings:        eval time = 52500.58 ms /   127 runs   (  413.39 ms per token)
llama_print_timings:       total time = 55559.90 ms
@sw
Copy link
Contributor

sw commented May 12, 2023

The Windows builds are failing, I think you need to use defined:

#if defined(__AVX__) || defined(__AVX2__) || defined(__AVX512F__) || defined(__SSE3__)

@3ooabkhxtn
Copy link
Contributor Author

defined(__SSE3__)

Only put the SSE3 into defined() as I don't want to touch the other parts of the code, as I not know what is intended. Let's see if it fixes the problem.

@sw
Copy link
Contributor

sw commented May 12, 2023

It's fine to use defined() for all the variants, but it should actually be __SSSE3__ (note: three S), because _mm_maddubs_epi16 is only supported by Supplemental SSE3.

- Use __SSSE3__ instead of __SSE__
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@sw sw merged commit ac0cd25 into ggml-org:master May 13, 2023
@rankaiyx
Copy link
Contributor

rankaiyx commented May 30, 2023

How can I compile the program that uses only SSE3 instructions for reasoning on a Linux computer with avx2 in CPU?

Is it okay to modify the Makefile like this?

# Use all CPU extensions that are available:
#CFLAGS   += -march=native -mtune=native
#CXXFLAGS += -march=native -mtune=native

# Usage AVX-only
#CFLAGS   += -mfma -mf16c -mavx
#CXXFLAGS += -mfma -mf16c -mavx

CFLAGS   += -msse3
CXXFLAGS += -msse3

@3ooabkhxtn
Copy link
Contributor Author

Should do it. But I think you need -mssse3 as we check for SSSE3

@rankaiyx
Copy link
Contributor

rankaiyx commented May 31, 2023

It's really ssse3, and this has an accelerating effect. But sse3 can also make the compilation flag sse3 display as 1, even if there is no acceleration effect.
SSE3:
system_info: n_threads = 30 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
SSSE3:
system_info: n_threads = 30 / 32 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants