You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently, generating a text with large preexisting context has become very slow when using GPU offloading. I tried llama-cpp-python versions 0.1.61 and 0.1.57 - I get the same behavior.
I am running an Oobabooga installation on Windows 11. The machine is AMD3950X, 32 gb RAM, 3070Ti (8 Gb VRAM). When I try to extend an already long text, with GPU offloading used, I get these numbers:
llama_print_timings: load time = 36434.52 ms
llama_print_timings: sample time = 40.21 ms / 65 runs ( 0.62 ms per token)
llama_print_timings: prompt eval time = 146712.89 ms / 1294 tokens ( 113.38 ms per token)
llama_print_timings: eval time = 15896.38 ms / 64 runs ( 248.38 ms per token)
llama_print_timings: total time = 163419.18 ms
Output generated in 163.82 seconds (0.39 tokens/s, 64 tokens, context 1623, seed 1809972029)
Llama.generate: prefix-match hit
note the prompt eval time - two minutes. But without the offloading I get:
llama_print_timings: load time = 5647.99 ms
llama_print_timings: sample time = 66.99 ms / 111 runs ( 0.60 ms per token)
llama_print_timings: prompt eval time = 10762.28 ms / 1245 tokens ( 8.64 ms per token)
llama_print_timings: eval time = 34006.30 ms / 110 runs ( 309.15 ms per token)
llama_print_timings: total time = 46578.66 ms
Output generated in 46.96 seconds (2.34 tokens/s, 110 tokens, context 1576, seed 874103700)
Llama.generate: prefix-match hit
Those runs use the same model (13Bq5_1), same prompt, similar context. The only difference is the argument --n-gpu-layers 26 in the first case and none in the second.
I did check the loading logs for lines like
llama_model_load_internal: offloading 16 layers to GPU
llama_model_load_internal: total VRAM used: 4143 MB
to confirm that it loads as I intended.
The text was updated successfully, but these errors were encountered:
Experiencing the same exact issue with the debug version CUBLAS. For context the release version gives buggy output. #1735
The buggy Release version with the gibberish output is far faster for me, but the properly working Debug version is extremely slow like observed here.
UPDATE: Using latest build made both of these issues seemingly go away on release version, debug is still slow I don't know why. Anyway it is far faster now, if anyone else is having trouble I recommend to build Release x64 from source and use the latest version (as of now it is 74a69d2).
Recently, generating a text with large preexisting context has become very slow when using GPU offloading. I tried llama-cpp-python versions 0.1.61 and 0.1.57 - I get the same behavior.
I am running an Oobabooga installation on Windows 11. The machine is AMD3950X, 32 gb RAM, 3070Ti (8 Gb VRAM). When I try to extend an already long text, with GPU offloading used, I get these numbers:
note the prompt eval time - two minutes. But without the offloading I get:
Those runs use the same model (13Bq5_1), same prompt, similar context. The only difference is the argument
--n-gpu-layers 26
in the first case and none in the second.I did check the loading logs for lines like
to confirm that it loads as I intended.
The text was updated successfully, but these errors were encountered: