Generation using GPU offloading is much slower than without #1786

Barafu · 2023-06-10T07:58:45Z

Recently, generating a text with large preexisting context has become very slow when using GPU offloading. I tried llama-cpp-python versions 0.1.61 and 0.1.57 - I get the same behavior.

I am running an Oobabooga installation on Windows 11. The machine is AMD3950X, 32 gb RAM, 3070Ti (8 Gb VRAM). When I try to extend an already long text, with GPU offloading used, I get these numbers:

llama_print_timings:        load time = 36434.52 ms
llama_print_timings:      sample time =    40.21 ms /    65 runs   (    0.62 ms per token)
llama_print_timings: prompt eval time = 146712.89 ms /  1294 tokens (  113.38 ms per token)
llama_print_timings:        eval time = 15896.38 ms /    64 runs   (  248.38 ms per token)
llama_print_timings:       total time = 163419.18 ms
Output generated in 163.82 seconds (0.39 tokens/s, 64 tokens, context 1623, seed 1809972029)
Llama.generate: prefix-match hit

note the prompt eval time - two minutes. But without the offloading I get:

llama_print_timings:        load time =  5647.99 ms
llama_print_timings:      sample time =    66.99 ms /   111 runs   (    0.60 ms per token)
llama_print_timings: prompt eval time = 10762.28 ms /  1245 tokens (    8.64 ms per token)
llama_print_timings:        eval time = 34006.30 ms /   110 runs   (  309.15 ms per token)
llama_print_timings:       total time = 46578.66 ms
Output generated in 46.96 seconds (2.34 tokens/s, 110 tokens, context 1576, seed 874103700)
Llama.generate: prefix-match hit

Those runs use the same model (13Bq5_1), same prompt, similar context. The only difference is the argument --n-gpu-layers 26 in the first case and none in the second.

I did check the loading logs for lines like

llama_model_load_internal: offloading 16 layers to GPU
llama_model_load_internal: total VRAM used: 4143 MB

to confirm that it loads as I intended.

The text was updated successfully, but these errors were encountered:

RiyanParvez · 2023-06-10T09:54:28Z

Same here, much much slower without gpu offloading in my case its close to 80ish ms per token, but with off loading its 700ish ms per token...

RiyanParvez · 2023-06-10T09:54:56Z

And it takes more time to load the model too.

TonyWeimmer40 · 2023-06-10T13:43:56Z

Experiencing the same exact issue with the debug version CUBLAS. For context the release version gives buggy output. #1735

The buggy Release version with the gibberish output is far faster for me, but the properly working Debug version is extremely slow like observed here.

UPDATE: Using latest build made both of these issues seemingly go away on release version, debug is still slow I don't know why. Anyway it is far faster now, if anyone else is having trouble I recommend to build Release x64 from source and use the latest version (as of now it is 74a69d2).

github-actions · 2024-04-10T01:07:28Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation using GPU offloading is much slower than without #1786

Generation using GPU offloading is much slower than without #1786

Barafu commented Jun 10, 2023 •

edited

Loading

RiyanParvez commented Jun 10, 2023

RiyanParvez commented Jun 10, 2023

TonyWeimmer40 commented Jun 10, 2023 •

edited

Loading

github-actions bot commented Apr 10, 2024

Generation using GPU offloading is much slower than without #1786

Generation using GPU offloading is much slower than without #1786

Comments

Barafu commented Jun 10, 2023 • edited Loading

RiyanParvez commented Jun 10, 2023

RiyanParvez commented Jun 10, 2023

TonyWeimmer40 commented Jun 10, 2023 • edited Loading

github-actions bot commented Apr 10, 2024

Barafu commented Jun 10, 2023 •

edited

Loading

TonyWeimmer40 commented Jun 10, 2023 •

edited

Loading