-
Notifications
You must be signed in to change notification settings - Fork 11.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed differences between vicuna models and the others. #968
Comments
Can you include some of the timing info output? |
The mentioned speed refers to the answers of the system by just counting responded tokens per second. Do you refer to a different speed? If yes, could you please supply a little more information. |
I also notice this. It's ~270ms/token for vicuna-13B, and 240ms/token for llama-13B on my system. I'll look into it, I suspect the model graph itself has changed, perhaps due to the way they fine tune - there may be more layers somewhere. |
It seems, that it depends on the quantisation used. Every model using Q4_0 is faster than other models not using Q4_0. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
When operating various 7B models (win10, Core I5, GCC64, 8GB, 4 threads) with the same program (relatively indifferent compared between
the recent revisions) I found the ggml-vicuna-7b-4bit-rev1.bin and ggml-vicuna-7b-4bit.bin much faster.
The other models take around 5 seconds per token, whereas vicunas are generating 3 token per second.
Llama and alpaca models are not very different (5 to 6 secs/token).
I wonder, what the reason might be.
The text was updated successfully, but these errors were encountered: