Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize rope function to avoid call powf in the loop #807

Merged
merged 1 commit into from
Apr 14, 2023

Conversation

howard0su
Copy link
Collaborator

This slightly improves the perf.

Before:
llama_print_timings: load time = 6973.64 ms
llama_print_timings: sample time = 345.88 ms / 128 runs ( 2.70 ms per run)
llama_print_timings: prompt eval time = 6227.73 ms / 29 tokens ( 214.75 ms per token)
llama_print_timings: eval time = 34661.36 ms / 127 runs ( 272.92 ms per run)
llama_print_timings: total time = 41992.88 ms

After:
llama_print_timings: load time = 7668.98 ms
llama_print_timings: sample time = 345.03 ms / 128 runs ( 2.70 ms per run)
llama_print_timings: prompt eval time = 6906.97 ms / 29 tokens ( 238.17 ms per token)
llama_print_timings: eval time = 33445.33 ms / 127 runs ( 263.35 ms per run)
llama_print_timings: total time = 41471.40 ms

@j-f1
Copy link
Collaborator

j-f1 commented Apr 6, 2023

Can you re-run the tests a few times (like a total 3–5 for each) and confirm that the improvement is not just a result of random noise in the amount of time the process takes to complete?

@howard0su
Copy link
Collaborator Author

yes, I actually did that already. the above is one sample:

Before:

llama_print_timings:        load time =  6508.98 ms
llama_print_timings:      sample time =    48.19 ms /    18 runs   (    2.68 ms per run)
llama_print_timings: prompt eval time =  5758.73 ms /    26 tokens (  221.49 ms per token)
llama_print_timings:        eval time =  4487.70 ms /    17 runs   (  263.98 ms per run)
llama_print_timings:       total time = 11046.59 ms

llama_print_timings:        load time =  6430.18 ms
llama_print_timings:      sample time =   306.51 ms /   115 runs   (    2.67 ms per run)
llama_print_timings: prompt eval time =  5708.87 ms /    26 tokens (  219.57 ms per token)
llama_print_timings:        eval time = 30756.65 ms /   114 runs   (  269.80 ms per run)
llama_print_timings:       total time = 37504.31 ms

llama_print_timings:        load time =  6422.55 ms
llama_print_timings:      sample time =   346.25 ms /   128 runs   (    2.71 ms per run)
llama_print_timings: prompt eval time =  5664.71 ms /    26 tokens (  217.87 ms per token)
llama_print_timings:        eval time = 33817.97 ms /   127 runs   (  266.28 ms per run)
llama_print_timings:       total time = 40599.00 ms

After:

llama_print_timings:        load time =  6399.90 ms
llama_print_timings:      sample time =   342.14 ms /   128 runs   (    2.67 ms per run)
llama_print_timings: prompt eval time =  5686.61 ms /    26 tokens (  218.72 ms per token)
llama_print_timings:        eval time = 33954.48 ms /   127 runs   (  267.36 ms per run)
llama_print_timings:       total time = 40708.52 ms

llama_print_timings:        load time =  6352.47 ms
llama_print_timings:      sample time =   342.94 ms /   128 runs   (    2.68 ms per run)
llama_print_timings: prompt eval time =  5649.50 ms /    26 tokens (  217.29 ms per token)
llama_print_timings:        eval time = 33675.62 ms /   127 runs   (  265.16 ms per run)
llama_print_timings:       total time = 40383.40 ms

llama_print_timings:        load time =  6454.62 ms
llama_print_timings:      sample time =   106.15 ms /    40 runs   (    2.65 ms per run)
llama_print_timings: prompt eval time =  5668.35 ms /    26 tokens (  218.01 ms per token)
llama_print_timings:        eval time = 10139.37 ms /    39 runs   (  259.98 ms per run)
llama_print_timings:       total time = 16703.99 ms

@KASR
Copy link
Contributor

KASR commented Apr 6, 2023

I've also gave it a try. I've used python to loop over the configurations, using 8 16 and 24 threads. Each test was evaluated 6 times with n=64. However, there were some things running in the background, hence the timings (see attachment) have a bit of variation.

I can set the loop for a larger n running more times (e.g. 10 ?) times each, let me know if this would be interesting.

plot 1

rope_test.txt

@howard0su
Copy link
Collaborator Author

If you can share your python script, it will be great. It is a good tool to help some performance experimenting.

@KASR
Copy link
Contributor

KASR commented Apr 6, 2023

If you can share your python script, it will be great. It is a good tool to help some performance experimenting.

Sure, as a warning it's just a simple loop 😅, you can find the code here --> benchmark_pr_llama.py

@ggerganov
Copy link
Member

Are we confident that this iterative multiplication is numerically the same as powf, or at least close enough to not cause trouble?

@howard0su
Copy link
Collaborator Author

yes this is not approximate. this is just a math equation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants