Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed differences between vicuna models and the others. #968

Closed
wro52 opened this issue Apr 14, 2023 · 5 comments
Closed

Speed differences between vicuna models and the others. #968

wro52 opened this issue Apr 14, 2023 · 5 comments
Labels

Comments

@wro52
Copy link

wro52 commented Apr 14, 2023

When operating various 7B models (win10, Core I5, GCC64, 8GB, 4 threads) with the same program (relatively indifferent compared between
the recent revisions) I found the ggml-vicuna-7b-4bit-rev1.bin and ggml-vicuna-7b-4bit.bin much faster.
The other models take around 5 seconds per token, whereas vicunas are generating 3 token per second.
Llama and alpaca models are not very different (5 to 6 secs/token).
I wonder, what the reason might be.

@prusnak prusnak changed the title [wro52] Speed differences between vicuna models and the others. Speed differences between vicuna models and the others. Apr 14, 2023
@MillionthOdin16
Copy link

Can you include some of the timing info output?

@wro52
Copy link
Author

wro52 commented Apr 15, 2023

The mentioned speed refers to the answers of the system by just counting responded tokens per second. Do you refer to a different speed? If yes, could you please supply a little more information.
One more point: The answers are comparabel between the various models.

@jon-chuang
Copy link
Contributor

I also notice this. It's ~270ms/token for vicuna-13B, and 240ms/token for llama-13B on my system. I'll look into it, I suspect the model graph itself has changed, perhaps due to the way they fine tune - there may be more layers somewhere.

@wro52
Copy link
Author

wro52 commented Apr 20, 2023

It seems, that it depends on the quantisation used. Every model using Q4_0 is faster than other models not using Q4_0.
I tested it with vicuna, alpaca,llama and different other models and it confirms.
It is indicated by ftype = 2 (mostly Q4_0) using the latest program versions.
So I guess, Q4_0 is ideal for running models on average office computers without sophisticated hardware.
Which also might be of importance for schools and the average poor student...

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants