Skip to content

Simplify the quantization process #463

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nullhook opened this issue Mar 24, 2023 · 3 comments
Closed

Simplify the quantization process #463

nullhook opened this issue Mar 24, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@nullhook
Copy link

The current quantization call stack is long and difficult to debug, which makes extending or adding new quantization methods in the future a major issue. This is because changes would need to be made in various places.

Additionally, we should aim to add drivers that help with benchmarking various quantization methods.

The current stack:

  1. quantize.py invokes the quantize binary
  2. quantize.cpp reads model and logs metrics
  3. llama.cpp loads model weights, checks quantization type, and sends to quantization function
  4. ggml.c performs the actual quantization

Open to suggestions here and would like to hear if it's worth investing our time and effort

@nullhook nullhook changed the title Simplify the Quantization Process Simplify the quantization process Mar 24, 2023
@gjmulder gjmulder added the enhancement New feature or request label Mar 24, 2023
@sw
Copy link
Contributor

sw commented Mar 25, 2023

Agree; I was recently confused by the various type ids (ggml_type vs model.hparams.f16 which doesn't have an enum)

Though I think for performance reasons you can't really put to much abstraction in the vector dot-product in ggml.c.

quantize.py could probably be removed if we manage to make quantize.cpp just a bit more user-friendly. Or make llama.cpp an executable and get rid of quantize.cpp too.

@anzz1
Copy link
Contributor

anzz1 commented Mar 26, 2023

quantize.py could probably be removed if we manage to make quantize.cpp just a bit more user-friendly.

That is true, quantize.py is an wholly unnecessary step.

Or make llama.cpp an executable and get rid of quantize.cpp too.

That would be going backwards, the reason for llama.cpp to exist is to have a common C API which can be interfaced by 'apps' like main, quantize, perplexity, etc.

ggml is shared with whisper.cpp so it needs to exist.

when you look at quantize.cpp there is really no logic there, it's just a wrapper for calling the API.
So in reality the amount of steps is 2, llama.cpp (llama api) and ggml.cpp (ggml api) after removing quantize.py.

The way I see it , it's completely opposite. Meaning that it makes changing things easier since there are these two apis which is shared by everything. When in the future there is inevitably going to be a lot more apps than the current main,quantize,perplexity , imagine having to change every single one of them instead of changing just the API. I just can't see how that would be a better option.

@prusnak
Copy link
Collaborator

prusnak commented Mar 30, 2023

quantize.py is not needed anymore (it was even dropped from the repo), so we already removed one step from the stack

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants