AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

x02Sylvie · 2023-04-04T20:02:43Z

Prefetches some data before it's usage
(basing on original work: #295)

Before (usually 245 - 260 ms):

After (usually 226 - 234 ms):

Tested on Windows 10 + i7-10700k, needs further testing on different OSes and CPUs to make sure there's no unintentional issues hence draft. Also naming constants appropriately!

Prefetch data used later in the loop

rabidcopy · 2023-04-04T20:51:59Z

Well, running master and this PR about 5 times in a row each, here's what I got. Ryzen 2600 on Linux.
./main -m ../alpaca-7b-native.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 6 --seed 1
Latest master:

llama_print_timings:        load time =  1526.48 ms
llama_print_timings:      sample time =    66.08 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =  1784.33 ms /    14 tokens (  127.45 ms per token)
llama_print_timings:        eval time = 20384.87 ms /    99 runs   (  205.91 ms per run)
llama_print_timings:       total time = 22740.42 ms

This PR:

llama_print_timings:        load time =  2450.54 ms
llama_print_timings:      sample time =    78.16 ms /   100 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =  1924.37 ms /    14 tokens (  137.46 ms per token)
llama_print_timings:        eval time = 19275.24 ms /    99 runs   (  194.70 ms per run)
llama_print_timings:       total time = 21805.14 ms

Edit: On a 13B model, the average is about 325ms versus 315ms.

ivanstepanovftw · 2023-04-04T21:22:14Z

Same for me, 7B q4_0
Fedora Linux, AMD Ryzen 7 5800U
Orig:


llama_print_timings:        load time =  1045.07 ms
llama_print_timings:      sample time =    21.53 ms /    40 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =   902.05 ms /    16 tokens (   56.38 ms per token)
llama_print_timings:        eval time =  7273.96 ms /    39 runs   (  186.51 ms per run)
llama_print_timings:       total time =  8789.19 ms

This pr:

llama_print_timings:        load time =  1051.30 ms
llama_print_timings:      sample time =    19.13 ms /    40 runs   (    0.48 ms per run)
llama_print_timings: prompt eval time =   912.66 ms /    16 tokens (   57.04 ms per token)
llama_print_timings:        eval time =  7778.33 ms /    39 runs   (  199.44 ms per run)
llama_print_timings:       total time =  9298.49 ms

ivanstepanovftw · 2023-04-04T21:32:07Z

13B q4_0

llama_print_timings:        load time =  1535.26 ms
llama_print_timings:      sample time =    21.43 ms /    40 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =  1780.11 ms /    16 tokens (  111.26 ms per token)
llama_print_timings:        eval time = 12984.40 ms /    39 runs   (  332.93 ms per run)
llama_print_timings:       total time = 15470.78 ms

llama_print_timings:        load time =  1563.62 ms
llama_print_timings:      sample time =    19.79 ms /    40 runs   (    0.49 ms per run)
llama_print_timings: prompt eval time =  1844.18 ms /    16 tokens (  115.26 ms per token)
llama_print_timings:        eval time = 14039.39 ms /    39 runs   (  359.98 ms per run)
llama_print_timings:       total time = 16580.04 ms

diimdeep · 2023-04-05T05:03:53Z

7b q4_0

1 core, --mlock -s 1 -n 32
llama_print_timings:      sample time =    44.60 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  2184.79 ms /     4 tokens (  546.20 ms per token)
llama_print_timings:        eval time = 22591.07 ms /    31 runs   (  728.74 ms per run)

after
llama_print_timings:      sample time =    42.91 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  2153.29 ms /     4 tokens (  538.32 ms per token)
llama_print_timings:        eval time = 20680.65 ms /    31 runs   (  667.12 ms per run)



2 cores, --mlock -s 1 -n 32
llama_print_timings:      sample time =    42.92 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  1106.27 ms /     4 tokens (  276.57 ms per token)
llama_print_timings:        eval time = 11378.82 ms /    31 runs   (  367.06 ms per run)

after
llama_print_timings:      sample time =    42.78 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  1084.36 ms /     4 tokens (  271.09 ms per token)
llama_print_timings:        eval time = 10726.84 ms /    31 runs   (  346.03 ms per run)

sw · 2023-04-06T18:07:24Z

ggml.c

@@ -1975,6 +1975,10 @@ static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void * rest

        // This loop will be unrolled by the compiler    
        for (int u=0;u<UNROLL_COUNT;u++)  {
+            // Prefetch data used later in the loop
+            // TODO these numbers are device dependent shouldn't be hard coded derive
+            _mm_prefetch ( x[i+u].qs + 32*20, 1);	// to-do: document what 32*20 even is	


Yes what even is it? ;-) Is 20 intended to be == sizeof(block_q4_0)?

Or could you use CACHE_LINE_SIZE here somehow?

An interesting experiment would be to pad block_q4_0 by e.g. 4 bytes and see if the optimum prefetch offset moves. That would give a clue that the offset should be a certain multiple of sizeof(block_q4_0), or rather independent of it.

ggerganov · 2023-04-13T13:13:38Z

Fix hardcoded constants and reopen if there is confirmed improvement

Update ggml.c

b1fa386

Prefetch data used later in the loop

rabidcopy mentioned this pull request Apr 5, 2023

Avoid heavy V transpose operation + improvements #775

Merged

sw reviewed Apr 6, 2023

View reviewed changes

ggerganov added the performance Speed related topics label Apr 10, 2023

ggerganov closed this Apr 13, 2023

Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023

Documenting server usage (ggml-org#768)

23a2219

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

x02Sylvie commented Apr 4, 2023 •

edited

Loading

rabidcopy commented Apr 4, 2023 •

edited

Loading

ivanstepanovftw commented Apr 4, 2023 •

edited

Loading

ivanstepanovftw commented Apr 4, 2023

diimdeep commented Apr 5, 2023

sw Apr 6, 2023 •

edited

Loading

ggerganov commented Apr 13, 2023

AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

Conversation

x02Sylvie commented Apr 4, 2023 • edited Loading

rabidcopy commented Apr 4, 2023 • edited Loading

ivanstepanovftw commented Apr 4, 2023 • edited Loading

ivanstepanovftw commented Apr 4, 2023

diimdeep commented Apr 5, 2023

sw Apr 6, 2023 • edited Loading

Choose a reason for hiding this comment

ggerganov commented Apr 13, 2023

x02Sylvie commented Apr 4, 2023 •

edited

Loading

rabidcopy commented Apr 4, 2023 •

edited

Loading

ivanstepanovftw commented Apr 4, 2023 •

edited

Loading

sw Apr 6, 2023 •

edited

Loading