perf: parallelize quantization #906

jon-chuang · 2023-04-12T03:38:23Z

Is currently single threaded. Quantization is quite slow (vicuna 7B: 65156.31 ms, vicuna 13B: 129902.48 ms).

sw · 2023-04-12T12:54:43Z

It did indeed speed things up. This could probably be integrated into llama_model_quantize_internal so that a new cpp module isn't necessary.

jon-chuang · 2023-04-12T14:17:59Z

Is the new quantization scheme the one that minimizes MSE against the original weights?

sw · 2023-04-22T17:45:20Z

Resolved by #1075

sw added the performance Speed related topics label Apr 12, 2023

sw closed this as completed Apr 22, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback