Simplify the quantization process #463

nullhook · 2023-03-24T14:47:25Z

The current quantization call stack is long and difficult to debug, which makes extending or adding new quantization methods in the future a major issue. This is because changes would need to be made in various places.

Additionally, we should aim to add drivers that help with benchmarking various quantization methods.

The current stack:

quantize.py invokes the quantize binary
quantize.cpp reads model and logs metrics
llama.cpp loads model weights, checks quantization type, and sends to quantization function
ggml.c performs the actual quantization

Open to suggestions here and would like to hear if it's worth investing our time and effort

sw · 2023-03-25T16:45:40Z

Agree; I was recently confused by the various type ids (ggml_type vs model.hparams.f16 which doesn't have an enum)

Though I think for performance reasons you can't really put to much abstraction in the vector dot-product in ggml.c.

quantize.py could probably be removed if we manage to make quantize.cpp just a bit more user-friendly. Or make llama.cpp an executable and get rid of quantize.cpp too.

anzz1 · 2023-03-26T13:24:18Z

quantize.py could probably be removed if we manage to make quantize.cpp just a bit more user-friendly.

That is true, quantize.py is an wholly unnecessary step.

Or make llama.cpp an executable and get rid of quantize.cpp too.

That would be going backwards, the reason for llama.cpp to exist is to have a common C API which can be interfaced by 'apps' like main, quantize, perplexity, etc.

ggml is shared with whisper.cpp so it needs to exist.

when you look at quantize.cpp there is really no logic there, it's just a wrapper for calling the API.
So in reality the amount of steps is 2, llama.cpp (llama api) and ggml.cpp (ggml api) after removing quantize.py.

The way I see it , it's completely opposite. Meaning that it makes changing things easier since there are these two apis which is shared by everything. When in the future there is inevitably going to be a lot more apps than the current main,quantize,perplexity , imagine having to change every single one of them instead of changing just the API. I just can't see how that would be a better option.

prusnak · 2023-03-30T23:25:20Z

quantize.py is not needed anymore (it was even dropped from the repo), so we already removed one step from the stack

nullhook changed the title ~~Simplify the Quantization Process~~ Simplify the quantization process Mar 24, 2023

gjmulder added the enhancement New feature or request label Mar 24, 2023

ggerganov closed this as completed Jul 28, 2023

Bearsaerker mentioned this issue Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the quantization process #463

Simplify the quantization process #463

nullhook commented Mar 24, 2023

sw commented Mar 25, 2023 •

edited

Loading

anzz1 commented Mar 26, 2023 •

edited

Loading

prusnak commented Mar 30, 2023

Simplify the quantization process #463

Simplify the quantization process #463

Comments

nullhook commented Mar 24, 2023

sw commented Mar 25, 2023 • edited Loading

anzz1 commented Mar 26, 2023 • edited Loading

prusnak commented Mar 30, 2023

sw commented Mar 25, 2023 •

edited

Loading

anzz1 commented Mar 26, 2023 •

edited

Loading