Speed differences between vicuna models and the others. #968

wro52 · 2023-04-14T14:42:44Z

When operating various 7B models (win10, Core I5, GCC64, 8GB, 4 threads) with the same program (relatively indifferent compared between
the recent revisions) I found the ggml-vicuna-7b-4bit-rev1.bin and ggml-vicuna-7b-4bit.bin much faster.
The other models take around 5 seconds per token, whereas vicunas are generating 3 token per second.
Llama and alpaca models are not very different (5 to 6 secs/token).
I wonder, what the reason might be.

MillionthOdin16 · 2023-04-14T19:41:42Z

Can you include some of the timing info output?

wro52 · 2023-04-15T09:15:53Z

The mentioned speed refers to the answers of the system by just counting responded tokens per second. Do you refer to a different speed? If yes, could you please supply a little more information.
One more point: The answers are comparabel between the various models.

jon-chuang · 2023-04-15T17:51:21Z

I also notice this. It's ~270ms/token for vicuna-13B, and 240ms/token for llama-13B on my system. I'll look into it, I suspect the model graph itself has changed, perhaps due to the way they fine tune - there may be more layers somewhere.

wro52 · 2023-04-20T13:48:45Z

It seems, that it depends on the quantisation used. Every model using Q4_0 is faster than other models not using Q4_0.
I tested it with vicuna, alpaca,llama and different other models and it confirms.
It is indicated by ftype = 2 (mostly Q4_0) using the latest program versions.
So I guess, Q4_0 is ideal for running models on average office computers without sophisticated hardware.
Which also might be of importance for schools and the average poor student...

github-actions · 2024-04-09T01:10:16Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

prusnak changed the title ~~[wro52] Speed differences between vicuna models and the others.~~ Speed differences between vicuna models and the others. Apr 14, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed differences between vicuna models and the others. #968

Speed differences between vicuna models and the others. #968

wro52 commented Apr 14, 2023

MillionthOdin16 commented Apr 14, 2023

wro52 commented Apr 15, 2023

jon-chuang commented Apr 15, 2023

wro52 commented Apr 20, 2023

github-actions bot commented Apr 9, 2024

Speed differences between vicuna models and the others. #968

Speed differences between vicuna models and the others. #968

Comments

wro52 commented Apr 14, 2023

MillionthOdin16 commented Apr 14, 2023

wro52 commented Apr 15, 2023

jon-chuang commented Apr 15, 2023

wro52 commented Apr 20, 2023

github-actions bot commented Apr 9, 2024