Skip to content

perf: parallelize quantization #906

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jon-chuang opened this issue Apr 12, 2023 · 3 comments
Closed

perf: parallelize quantization #906

jon-chuang opened this issue Apr 12, 2023 · 3 comments
Labels
performance Speed related topics

Comments

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 12, 2023

https://github.com/ggerganov/llama.cpp/blob/8b679987cdce292ff36bd741f6715e4927e26f9b/llama.cpp#L1558

Is currently single threaded. Quantization is quite slow (vicuna 7B: 65156.31 ms, vicuna 13B: 129902.48 ms).

@sw sw added the performance Speed related topics label Apr 12, 2023
@sw
Copy link
Contributor

sw commented Apr 12, 2023

@ikawrakow did that in #896, see kQuantizeQ4 in ggml_extra.cpp, but that's for a new quantization scheme. https://github.com/ggerganov/llama.cpp/blob/6bfb00a53b1a06e209f1b814356dd79ee96b89af/ggml_extra.cpp#L287-L291

It did indeed speed things up. This could probably be integrated into llama_model_quantize_internal so that a new cpp module isn't necessary.

@jon-chuang
Copy link
Contributor Author

jon-chuang commented Apr 12, 2023

Is the new quantization scheme the one that minimizes MSE against the original weights?

@sw
Copy link
Contributor

sw commented Apr 22, 2023

Resolved by #1075

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speed related topics
Projects
None yet
Development

No branches or pull requests

2 participants