Skip to content

AVX2 ggml_vec_dot_q4_0 performance improvement ~5% #768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

x02Sylvie
Copy link

@x02Sylvie x02Sylvie commented Apr 4, 2023

Prefetches some data before it's usage
(basing on original work: #295)

Before (usually 245 - 260 ms):
prefetch_wo

After (usually 226 - 234 ms):
prefetch_w

Tested on Windows 10 + i7-10700k, needs further testing on different OSes and CPUs to make sure there's no unintentional issues hence draft. Also naming constants appropriately!

Prefetch data used later in the loop
@rabidcopy
Copy link
Contributor

rabidcopy commented Apr 4, 2023

Well, running master and this PR about 5 times in a row each, here's what I got. Ryzen 2600 on Linux.
./main -m ../alpaca-7b-native.bin -p "Building a website can be done in 10 simple steps:" -n 100 -t 6 --seed 1
Latest master:

llama_print_timings:        load time =  1526.48 ms
llama_print_timings:      sample time =    66.08 ms /   100 runs   (    0.66 ms per run)
llama_print_timings: prompt eval time =  1784.33 ms /    14 tokens (  127.45 ms per token)
llama_print_timings:        eval time = 20384.87 ms /    99 runs   (  205.91 ms per run)
llama_print_timings:       total time = 22740.42 ms

This PR:

llama_print_timings:        load time =  2450.54 ms
llama_print_timings:      sample time =    78.16 ms /   100 runs   (    0.78 ms per run)
llama_print_timings: prompt eval time =  1924.37 ms /    14 tokens (  137.46 ms per token)
llama_print_timings:        eval time = 19275.24 ms /    99 runs   (  194.70 ms per run)
llama_print_timings:       total time = 21805.14 ms

Edit: On a 13B model, the average is about 325ms versus 315ms.

@ivanstepanovftw
Copy link
Collaborator

ivanstepanovftw commented Apr 4, 2023

Same for me, 7B q4_0
Fedora Linux, AMD Ryzen 7 5800U
Orig:


llama_print_timings:        load time =  1045.07 ms
llama_print_timings:      sample time =    21.53 ms /    40 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =   902.05 ms /    16 tokens (   56.38 ms per token)
llama_print_timings:        eval time =  7273.96 ms /    39 runs   (  186.51 ms per run)
llama_print_timings:       total time =  8789.19 ms

This pr:

llama_print_timings:        load time =  1051.30 ms
llama_print_timings:      sample time =    19.13 ms /    40 runs   (    0.48 ms per run)
llama_print_timings: prompt eval time =   912.66 ms /    16 tokens (   57.04 ms per token)
llama_print_timings:        eval time =  7778.33 ms /    39 runs   (  199.44 ms per run)
llama_print_timings:       total time =  9298.49 ms

@ivanstepanovftw
Copy link
Collaborator

13B q4_0

llama_print_timings:        load time =  1535.26 ms
llama_print_timings:      sample time =    21.43 ms /    40 runs   (    0.54 ms per run)
llama_print_timings: prompt eval time =  1780.11 ms /    16 tokens (  111.26 ms per token)
llama_print_timings:        eval time = 12984.40 ms /    39 runs   (  332.93 ms per run)
llama_print_timings:       total time = 15470.78 ms

llama_print_timings:        load time =  1563.62 ms
llama_print_timings:      sample time =    19.79 ms /    40 runs   (    0.49 ms per run)
llama_print_timings: prompt eval time =  1844.18 ms /    16 tokens (  115.26 ms per token)
llama_print_timings:        eval time = 14039.39 ms /    39 runs   (  359.98 ms per run)
llama_print_timings:       total time = 16580.04 ms

@diimdeep
Copy link

diimdeep commented Apr 5, 2023

7b q4_0

1 core, --mlock -s 1 -n 32
llama_print_timings:      sample time =    44.60 ms /    32 runs   (    1.39 ms per run)
llama_print_timings: prompt eval time =  2184.79 ms /     4 tokens (  546.20 ms per token)
llama_print_timings:        eval time = 22591.07 ms /    31 runs   (  728.74 ms per run)

after
llama_print_timings:      sample time =    42.91 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  2153.29 ms /     4 tokens (  538.32 ms per token)
llama_print_timings:        eval time = 20680.65 ms /    31 runs   (  667.12 ms per run)



2 cores, --mlock -s 1 -n 32
llama_print_timings:      sample time =    42.92 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  1106.27 ms /     4 tokens (  276.57 ms per token)
llama_print_timings:        eval time = 11378.82 ms /    31 runs   (  367.06 ms per run)

after
llama_print_timings:      sample time =    42.78 ms /    32 runs   (    1.34 ms per run)
llama_print_timings: prompt eval time =  1084.36 ms /     4 tokens (  271.09 ms per token)
llama_print_timings:        eval time = 10726.84 ms /    31 runs   (  346.03 ms per run)

@@ -1975,6 +1975,10 @@ static void ggml_vec_dot_q4_0(const int n, float * restrict s, const void * rest

// This loop will be unrolled by the compiler
for (int u=0;u<UNROLL_COUNT;u++) {
// Prefetch data used later in the loop
// TODO these numbers are device dependent shouldn't be hard coded derive
_mm_prefetch ( x[i+u].qs + 32*20, 1); // to-do: document what 32*20 even is
Copy link
Contributor

@sw sw Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes what even is it? ;-) Is 20 intended to be == sizeof(block_q4_0)?

Or could you use CACHE_LINE_SIZE here somehow?

An interesting experiment would be to pad block_q4_0 by e.g. 4 bytes and see if the optimum prefetch offset moves. That would give a clue that the offset should be a certain multiple of sizeof(block_q4_0), or rather independent of it.

@ggerganov ggerganov added the performance Speed related topics label Apr 10, 2023
@ggerganov
Copy link
Member

Fix hardcoded constants and reopen if there is confirmed improvement

@ggerganov ggerganov closed this Apr 13, 2023
Deadsg pushed a commit to Deadsg/llama.cpp that referenced this pull request Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants