Closed
Description
CUDA GPU inference is slower for the latest version(1449) in comparison to 1336:
1449
llama_print_timings: load time = 3111.53 ms
llama_print_timings: sample time = 99.26 ms / 617 runs ( 0.16 ms per token, 6215.81 tokens per second)
llama_print_timings: prompt eval time = 73.73 ms / 22 tokens ( 3.35 ms per token, 298.37 tokens per second)
llama_print_timings: eval time = 11428.62 ms / 616 runs ( 18.55 ms per token, 53.90 tokens per second)
llama_print_timings: total time = 11679.26 ms
1336
llama_print_timings: load time = 3150.73 ms
llama_print_timings: sample time = 149.46 ms / 623 runs ( 0.24 ms per token, 4168.31 tokens per second)
llama_print_timings: prompt eval time = 115.86 ms / 23 tokens ( 5.04 ms per token, 198.52 tokens per second)
llama_print_timings: eval time = 9558.68 ms / 622 runs ( 15.37 ms per token, 65.07 tokens per second)
llama_print_timings: total time = 10518.21 ms