-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Q4 quantization support #197
Conversation
Converted to draft, I will only merge this after being showcased in a real model example. |
I've been thinking about this some more from We'd need to store the hyperparameters and vocabulary ( |
Ideally the vocabulary would be in What do you mean by hyperparameters ? Now supporting Q4 will require a bit more work, I've deep dived into it, and it's not exactly n-bits per parameter. It's more n-bits per group of 32 q4. And the number is not the same for q4_0 and q4_1 (which I think would be more correctly named q4_0_32, q4_1_32 since the packing size is quite critical (ggerganov/llama.cpp#1004) |
Hmm, fair enough. We prefer single-file deployments for the convenience, but it makes sense to have a standard here.
Vocabulary size, dimensions, heads, layers. The usual. I imagine that's part of HF's
Yeah, that makes sense. No rush on this, we'll support it when it's ready :) |
Currently yes.
Afaik, you always need to graph of computation too, which is neither included in It's currently not a goal to be single file deployments for that reason (and because you can write a different/better program with the same weights which we happen to do quite regularly). Please let me know when you have q4 support in whatever format I'll take a look on how to enable here. |
ggml has added 2 packing formats. Those are better. q4_0, q4_2: q4_0: 32 ints packed Maybe a more descriptive name is better? |
I had in mind The thing is that this format packs the scale+zero point. GPTQ splits those in different tensors: https://github.com/qwopqwop200/GPTQ-for-LLaMa I'm not sure how much the locality helps performance there. Also adding New formats (especially a matrix of them, since currently there are 3, 4, 5 bits quantization schemes along 16, 32 packing (128 in gptq and full row) and (scale, scale+zero) ) makes a lot of added complexity on the types, and none of them would be loadable in It's not at all a problem to add specific types, but since we have to maintain them until the end of time, I think it would be nice to do it when community settles on common grounds on them. My current understanding is that ggml is recommending q5_1_32, while gptq recommends q4_1_128 ( GPT does different packing scheme which works better than the naive ggml hence the reduced bitsize iiuc) |
Current safetensors support bfloat16, but is only supported by torch/tf, not numpy. The problem is that the official loader is too restrictive. On meeting unknown types it just gives up. The upper case dtype naming due to serde is weird too. ( Maybe we should have a place to document custom types. Something like IANA registry for types. Features:
The loader code is simple that applications can write their own. I already made a tool to quantize safetensors models to every quantized format ggml supports: https://github.com/iacore/model-conversions/tree/main/quantize-wizard |
I know and which is why I said it's not a blocker to add custom types. (Just still something to think about when adding things). Here
I like the idea, but it wouldn't work for GPTQ for instance, since GPTQ splits the packed quantized unit and the scales and zeros into different tensors. Because the "unit" is not even in a single tensor there. Maybe this splitting is just a bad idea, I haven't formally checked this yet. (Meaning we could stop thinking about GPTQ there and the idea you suggest works.) |
No. Quantization is useful for RWKV (not transformer). Maybe it's also for other ANN as well.
How does it work? What's the quantized struct in C? |
Complete story : https://github.com/qwopqwop200/GPTQ-for-LLaMa/blob/triton/quant/quant_linear.py#L73 Simple story, they store zeros and scales in a different tensor altogether than the 4bits packed "weights" tensor. It allows for non linear mapping of packings, which is an important aspect of the method, where they pack quantization with respect to activations, which supposedly handles outliers better (and hence less variance in degradation when doing the quantization). |
That seems easier. Just store them as different tensors inside .safetensors. |
FYI GGML just got the ability to load/export graphs. It's not exactly what was discussed here but it might be usable for inference. |
Temporary PR, need to figure out a way to make sure this is usable in practice.
Either make the format work for llama.cpp &co (but the models over there include tokenization so...)
Or make something like smelt work with quantized data.