optimize rope function to avoid call powf in the loop #807

howard0su · 2023-04-06T11:56:27Z

This slightly improves the perf.

Before:
llama_print_timings: load time = 6973.64 ms
llama_print_timings: sample time = 345.88 ms / 128 runs ( 2.70 ms per run)
llama_print_timings: prompt eval time = 6227.73 ms / 29 tokens ( 214.75 ms per token)
llama_print_timings: eval time = 34661.36 ms / 127 runs ( 272.92 ms per run)
llama_print_timings: total time = 41992.88 ms

After:
llama_print_timings: load time = 7668.98 ms
llama_print_timings: sample time = 345.03 ms / 128 runs ( 2.70 ms per run)
llama_print_timings: prompt eval time = 6906.97 ms / 29 tokens ( 238.17 ms per token)
llama_print_timings: eval time = 33445.33 ms / 127 runs ( 263.35 ms per run)
llama_print_timings: total time = 41471.40 ms

j-f1 · 2023-04-06T12:08:35Z

Can you re-run the tests a few times (like a total 3–5 for each) and confirm that the improvement is not just a result of random noise in the amount of time the process takes to complete?

howard0su · 2023-04-06T13:08:47Z

yes, I actually did that already. the above is one sample:

Before:

llama_print_timings:        load time =  6508.98 ms
llama_print_timings:      sample time =    48.19 ms /    18 runs   (    2.68 ms per run)
llama_print_timings: prompt eval time =  5758.73 ms /    26 tokens (  221.49 ms per token)
llama_print_timings:        eval time =  4487.70 ms /    17 runs   (  263.98 ms per run)
llama_print_timings:       total time = 11046.59 ms

llama_print_timings:        load time =  6430.18 ms
llama_print_timings:      sample time =   306.51 ms /   115 runs   (    2.67 ms per run)
llama_print_timings: prompt eval time =  5708.87 ms /    26 tokens (  219.57 ms per token)
llama_print_timings:        eval time = 30756.65 ms /   114 runs   (  269.80 ms per run)
llama_print_timings:       total time = 37504.31 ms

llama_print_timings:        load time =  6422.55 ms
llama_print_timings:      sample time =   346.25 ms /   128 runs   (    2.71 ms per run)
llama_print_timings: prompt eval time =  5664.71 ms /    26 tokens (  217.87 ms per token)
llama_print_timings:        eval time = 33817.97 ms /   127 runs   (  266.28 ms per run)
llama_print_timings:       total time = 40599.00 ms

After:

llama_print_timings:        load time =  6399.90 ms
llama_print_timings:      sample time =   342.14 ms /   128 runs   (    2.67 ms per run)
llama_print_timings: prompt eval time =  5686.61 ms /    26 tokens (  218.72 ms per token)
llama_print_timings:        eval time = 33954.48 ms /   127 runs   (  267.36 ms per run)
llama_print_timings:       total time = 40708.52 ms

llama_print_timings:        load time =  6352.47 ms
llama_print_timings:      sample time =   342.94 ms /   128 runs   (    2.68 ms per run)
llama_print_timings: prompt eval time =  5649.50 ms /    26 tokens (  217.29 ms per token)
llama_print_timings:        eval time = 33675.62 ms /   127 runs   (  265.16 ms per run)
llama_print_timings:       total time = 40383.40 ms

llama_print_timings:        load time =  6454.62 ms
llama_print_timings:      sample time =   106.15 ms /    40 runs   (    2.65 ms per run)
llama_print_timings: prompt eval time =  5668.35 ms /    26 tokens (  218.01 ms per token)
llama_print_timings:        eval time = 10139.37 ms /    39 runs   (  259.98 ms per run)
llama_print_timings:       total time = 16703.99 ms

KASR · 2023-04-06T13:20:22Z

I've also gave it a try. I've used python to loop over the configurations, using 8 16 and 24 threads. Each test was evaluated 6 times with n=64. However, there were some things running in the background, hence the timings (see attachment) have a bit of variation.

I can set the loop for a larger n running more times (e.g. 10 ?) times each, let me know if this would be interesting.

rope_test.txt

howard0su · 2023-04-06T13:39:47Z

If you can share your python script, it will be great. It is a good tool to help some performance experimenting.

KASR · 2023-04-06T13:43:34Z

If you can share your python script, it will be great. It is a good tool to help some performance experimenting.

Sure, as a warning it's just a simple loop 😅, you can find the code here --> benchmark_pr_llama.py

ggerganov · 2023-04-13T13:19:13Z

Are we confident that this iterative multiplication is numerically the same as powf, or at least close enough to not cause trouble?

howard0su · 2023-04-13T17:16:38Z

yes this is not approximate. this is just a math equation.

optimize rope function to avoid call powf in the tight loop

3b3caf2

ggerganov approved these changes Apr 14, 2023

View reviewed changes

ggerganov merged commit c5d70f5 into ggml-org:master Apr 14, 2023

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize rope function to avoid call powf in the loop #807

optimize rope function to avoid call powf in the loop #807

howard0su commented Apr 6, 2023

j-f1 commented Apr 6, 2023

howard0su commented Apr 6, 2023

KASR commented Apr 6, 2023

howard0su commented Apr 6, 2023

KASR commented Apr 6, 2023

ggerganov commented Apr 13, 2023

howard0su commented Apr 13, 2023

optimize rope function to avoid call powf in the loop #807

optimize rope function to avoid call powf in the loop #807

Conversation

howard0su commented Apr 6, 2023

j-f1 commented Apr 6, 2023

howard0su commented Apr 6, 2023

KASR commented Apr 6, 2023

howard0su commented Apr 6, 2023

KASR commented Apr 6, 2023

ggerganov commented Apr 13, 2023

howard0su commented Apr 13, 2023