Add AVX acceleration #617

perserk · 2023-03-30T05:10:47Z

My old Xeon E5-2670 doesn't have AVX2 support. So I have added AVX acceleration to quantize_row_q4_0() and ggml_vec_dot_q4_0().

Here is the result before the change:

./main -m ./models/ggml-alpaca-7b-q4.bin --color -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 4 -p 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas.'
main: seed = 1680149433
llama_model_load: loading model from './models/ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 0


 Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas. Alpacas are a species of South American camelid that are bred primarily for their fleece. They are smaller than llamas, and have a finer, softer fleece that is lighter in weight and warmer in nature. Alpacas are shorn once a year, in the summer, and the fleece is then used for a variety of products, including clothing, home furnishings, and crafts. Alpacas are also bred for their meat, which is lean and flavorful. What is the difference between an alpaca and a ll
llama_print_timings:        load time =  4579.85 ms
llama_print_timings:      sample time =   451.16 ms /   128 runs   (    3.52 ms per run)
llama_print_timings: prompt eval time = 99743.01 ms /    29 tokens ( 3439.41 ms per token)
llama_print_timings:        eval time = 459193.73 ms /   127 runs   ( 3615.70 ms per run)
llama_print_timings:       total time = 565421.86 ms

and after:

./main -m ./models/ggml-alpaca-7b-q4.bin --color -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 4 -p 'Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas.'
main: seed = 1680151620
llama_model_load: loading model from './models/ggml-alpaca-7b-q4.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 4096
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 32
llama_model_load: n_layer = 32
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 11008
llama_model_load: n_parts = 1
llama_model_load: type    = 1
llama_model_load: ggml ctx size = 4273.34 MB
llama_model_load: mem required  = 6065.34 MB (+ 1026.00 MB per state)
llama_model_load: loading model part 1/1 from './models/ggml-alpaca-7b-q4.bin'
llama_model_load: .................................... done
llama_model_load: model size =  4017.27 MB / num tensors = 291
llama_init_from_file: kv self size  =  256.00 MB

system_info: n_threads = 4 / 30 | AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
sampling: temp = 0.200000, top_k = 10000, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.000000
generate: n_ctx = 512, n_batch = 256, n_predict = 128, n_keep = 0


 Below is an instruction that describes a task. Write a response that appropriately completes the request. Tell me about alpacas. Alpacas are domesticated animals that are related to camels and are native to South America. They are typically kept as livestock and are known for their wool, which is very soft and silky. Alpacas are shy and typically flee when they see humans, but they can also be very friendly and curious. They typically live in herds of 5 to 20 animals and can live up to 20 years in captivity. Alpacas are an important source of income for many South American families, as their wool can be sold for various products, such as cl
llama_print_timings:        load time =  5761.50 ms
llama_print_timings:      sample time =   464.30 ms /   128 runs   (    3.63 ms per run)
llama_print_timings: prompt eval time = 13906.61 ms /    29 tokens (  479.54 ms per token)
llama_print_timings:        eval time = 79917.90 ms /   127 runs   (  629.27 ms per run)
llama_print_timings:       total time = 101737.53 ms

anzz1 · 2023-03-30T05:14:43Z

That is a very significant increase !
Great work ! 👍

sw · 2023-03-30T09:28:03Z

For the vector dot-product, have you looked at _mm_maddubs_epi16? The AVX2 equivalent did not make much difference, but here it might. See here for an explanation of how to deal with the signs: https://astojanov.github.io/projects/clover/

In the inner loop, after subtracting 8:

...
    // Get absolute values of x vectors
    const __m128i ax = _mm_sign_epi8(bx, bx);

    // Sign the values of the y vectors
    const __m128i sy = _mm_sign_epi8(by, bx);

    // Perform multiplication and create 16-bit values
    const __m128i dot = _mm_maddubs_epi16(ax, sy);

    const __m128i ones = _mm_set1_epi16(1);
    i32[j] = _mm_madd_epi16(ones, dot);
}

Beware: this is completely untested.

RiccaDS · 2023-03-30T12:21:32Z

Great job! In my case the premise is I am using alpaca.cpp, so I just modified the ggml.c according to @perserk modifications. I don't know therefore what kind of difference this might pose. I had an improvement although not as dramatic as the one the OP had. I'm on i7-2630QM, 16GB RAM. I hope I implemented it correctly. As I have time I'll try this on llama.cpp
BEFORE

AFTER

Green-Sky · 2023-03-30T12:46:24Z

@RiccaDS why would you use alpaca.cpp ? - afaik llama.cpp has all and more features of alpaca.cpp

RiccaDS · 2023-03-30T13:15:09Z

@Green-Sky Merely a matter of first approach. I heard about alpaca and llama a few days ago. From my few readings I understood that Alpaca was more chatGPT-like, whilst llama was more like the auto-completion core on which Alpaca is based. But I might have understand it incorrectly, my knowledge on these models is pretty scarce, let alone on AI in general. I will try out llama too thanks.

ggml-org#617 (comment)

perserk · 2023-03-30T17:28:13Z

@sw Thanks for the link and the patch. I changed the code and it got a bit faster, at least not slower.

sw · 2023-03-30T18:50:28Z

It's looking good, almost too good ;-) because your AVX acceleration is almost as fast as the AVX2 we currently have.

What I mean by that: if I have an AVX2-capable CPU and allow it with the -mavx2 flag, but then disable the AVX2 code paths that have an AVX alternative on your PR, I don't see much difference in performance at all.

#elif 0 //defined(__AVX2__)
... // don't use AVX2 explicitly
#elif defined(__AVX__)
... // but use AVX intrinsics and let the compiler do its magic

This means we could just have a plain AVX implementation that would work on your Xeon but also be just as fast on AVX2 processors as before. Can anyone else confirm or disprove this?

This would be preferable because the code is getting quite unwieldy with all the different processor optimizations.

ggerganov · 2023-03-30T19:01:28Z

@sw

I think this is very possible. I think this can help explain the observations in #603
The deprecated mad routines were apparently utilizing much more efficiently the CPU / Memory on x86. I guess as efficient as optimized AVX2-level. But then, I removed them and effectively replaced them with a non-optimal AVX2 dot product implementation (the one that we currently have), which is probably at the level of an optimized AVX dot product (as we observe in this PR).

In short, I believe the AVX2 code has room for improvements.
My suspicion is that QK == 64 would have been more suitable for AVX2, but since Apple Silicon is highest priority, I chose QK == 32.

slaren · 2023-03-30T20:38:02Z

I tried adapting the same technique used here to AVX2 (src) and it performs a little better:

Run on (16 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 1.11, 3.59, 2.64
------------------------------------------------------------------------
Benchmark                              Time             CPU   Iterations
------------------------------------------------------------------------
BM_ggml_vec_dot_q4_0_avx             679 ns          679 ns      1032721
BM_ggml_vec_dot_q4_0_avx2            590 ns          590 ns      1185302
BM_ggml_vec_dot_q4_0_avx2_new        525 ns          525 ns      1329767

While it is not exactly twice as fast, there is still a significant performance advantage in using AVX2.

This change reduces eval times for me just a bit, but in the perplexity calculation the difference is bigger, from 43 seconds/pass to 35 seconds/pass.

ggerganov · 2023-03-31T05:25:41Z

@slaren

If you observe good results with this and #642 , go ahead and merge both PR. I won't have much time to look today.

Btw, here is also something worth looking into: #205 (comment)

* ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() ggml-org#617 (comment)

* ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() ggml-org/llama.cpp#617 (comment)

perserk added 2 commits March 30, 2023 09:43

ggml : add AVX quantize_row_q4_0()

79e1412

ggml : add AVX ggml_vec_dot_q4_0()

93a3169

perserk mentioned this pull request Mar 30, 2023

Can it support avx cpu's older than 10 years old? #451

Closed

slaren mentioned this pull request Mar 30, 2023

Error: inlining failed in call to ‘always_inline’ ‘_mm256_cvtph_ps’ on x86_64 - better support for different x86_64 CPU instruction extensions #196

Closed

RiccaDS mentioned this pull request Mar 30, 2023

Make issue. Maybe flags? cocktailpeanut/dalai#302

Open

ggml : refactor AVX part of ggml_vec_dot_q4_0()

80dad79

ggml-org#617 (comment)

slaren mentioned this pull request Mar 31, 2023

Optimize AVX2 ggml_vec_dot_q4_0 #642

Merged

slaren approved these changes Mar 31, 2023

View reviewed changes

slaren merged commit 02c5b27 into ggml-org:master Mar 31, 2023

rabidcopy mentioned this pull request Mar 31, 2023

10+% performance improvement of ggml_vec_dot_q4_0 on AVX2 #654

Merged

Nuked88 pushed a commit to Nuked88/llama.http that referenced this pull request Mar 31, 2023

Add AVX acceleration (ggml-org#617)

760d1f3

* ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() ggml-org#617 (comment)

p-v-a mentioned this pull request Aug 11, 2023

unexpectedly reached end of fileSIGILL: illegal instruction mudler/LocalAI#288

Closed

YuMJie pushed a commit to YuMJie/powerinfer that referenced this pull request Oct 25, 2024

Add AVX acceleration (#617)

f49b853

* ggml : add AVX quantize_row_q4_0() * ggml : add AVX ggml_vec_dot_q4_0() * ggml : refactor AVX part of ggml_vec_dot_q4_0() ggml-org/llama.cpp#617 (comment)

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX acceleration #617

Add AVX acceleration #617

perserk commented Mar 30, 2023

anzz1 commented Mar 30, 2023

sw commented Mar 30, 2023

RiccaDS commented Mar 30, 2023 •

edited

Loading

Green-Sky commented Mar 30, 2023

RiccaDS commented Mar 30, 2023 •

edited

Loading

perserk commented Mar 30, 2023

sw commented Mar 30, 2023 •

edited

Loading

ggerganov commented Mar 30, 2023 •

edited

Loading

slaren commented Mar 30, 2023 •

edited

Loading

ggerganov commented Mar 31, 2023

Add AVX acceleration #617

Add AVX acceleration #617

Conversation

perserk commented Mar 30, 2023

anzz1 commented Mar 30, 2023

sw commented Mar 30, 2023

RiccaDS commented Mar 30, 2023 • edited Loading

Green-Sky commented Mar 30, 2023

RiccaDS commented Mar 30, 2023 • edited Loading

perserk commented Mar 30, 2023

sw commented Mar 30, 2023 • edited Loading

ggerganov commented Mar 30, 2023 • edited Loading

slaren commented Mar 30, 2023 • edited Loading

ggerganov commented Mar 31, 2023

RiccaDS commented Mar 30, 2023 •

edited

Loading

RiccaDS commented Mar 30, 2023 •

edited

Loading

sw commented Mar 30, 2023 •

edited

Loading

ggerganov commented Mar 30, 2023 •

edited

Loading

slaren commented Mar 30, 2023 •

edited

Loading