Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation using GPU offloading is much slower than without #1786

Closed
Barafu opened this issue Jun 10, 2023 · 4 comments
Closed

Generation using GPU offloading is much slower than without #1786

Barafu opened this issue Jun 10, 2023 · 4 comments
Labels

Comments

@Barafu
Copy link

Barafu commented Jun 10, 2023

Recently, generating a text with large preexisting context has become very slow when using GPU offloading. I tried llama-cpp-python versions 0.1.61 and 0.1.57 - I get the same behavior.

I am running an Oobabooga installation on Windows 11. The machine is AMD3950X, 32 gb RAM, 3070Ti (8 Gb VRAM). When I try to extend an already long text, with GPU offloading used, I get these numbers:

llama_print_timings:        load time = 36434.52 ms
llama_print_timings:      sample time =    40.21 ms /    65 runs   (    0.62 ms per token)
llama_print_timings: prompt eval time = 146712.89 ms /  1294 tokens (  113.38 ms per token)
llama_print_timings:        eval time = 15896.38 ms /    64 runs   (  248.38 ms per token)
llama_print_timings:       total time = 163419.18 ms
Output generated in 163.82 seconds (0.39 tokens/s, 64 tokens, context 1623, seed 1809972029)
Llama.generate: prefix-match hit

note the prompt eval time - two minutes. But without the offloading I get:

llama_print_timings:        load time =  5647.99 ms
llama_print_timings:      sample time =    66.99 ms /   111 runs   (    0.60 ms per token)
llama_print_timings: prompt eval time = 10762.28 ms /  1245 tokens (    8.64 ms per token)
llama_print_timings:        eval time = 34006.30 ms /   110 runs   (  309.15 ms per token)
llama_print_timings:       total time = 46578.66 ms
Output generated in 46.96 seconds (2.34 tokens/s, 110 tokens, context 1576, seed 874103700)
Llama.generate: prefix-match hit

Those runs use the same model (13Bq5_1), same prompt, similar context. The only difference is the argument --n-gpu-layers 26 in the first case and none in the second.

I did check the loading logs for lines like

llama_model_load_internal: offloading 16 layers to GPU
llama_model_load_internal: total VRAM used: 4143 MB

to confirm that it loads as I intended.

@RiyanParvez
Copy link

Same here, much much slower without gpu offloading in my case its close to 80ish ms per token, but with off loading its 700ish ms per token...

@RiyanParvez
Copy link

And it takes more time to load the model too.

@TonyWeimmer40
Copy link

TonyWeimmer40 commented Jun 10, 2023

Experiencing the same exact issue with the debug version CUBLAS. For context the release version gives buggy output. #1735

The buggy Release version with the gibberish output is far faster for me, but the properly working Debug version is extremely slow like observed here.

UPDATE: Using latest build made both of these issues seemingly go away on release version, debug is still slow I don't know why. Anyway it is far faster now, if anyone else is having trouble I recommend to build Release x64 from source and use the latest version (as of now it is 74a69d2).

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants