-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization does not write the quantization version to ftype
#1590
Comments
Looking at llm it seems that the intended format for
As far as I can see, this also requires updating the loader in llama.cpp (
So, I think this would need to be treated as a breaking change, as the code currently neither writes nor expects to read Should |
What we did was to treat GGJTv2 as QNT1 and GGJTv3 as QNT2; The llama.cpp loader can make the same assumption and behave correctly from now on (including raising an error if the versions don't match). The only point of concern is that other llama.cpp-based loaders aren't going to modulo the ftype until they update, so they'll read an invalid ftype. In practice, I'm not sure if this would be an issue; iirc the ftype isn't actually used much, as the tensors carry their own type and ftype isn't used to decide anything important. (I could be wrong! I haven't looked into this recently) @LostRuins would koboldcpp have any issues with this? Does it attach any existing semantic meaning to the ftype? |
@philpax No, there are no issue with determining ftype in the file for me so far. Modulo for ftype is only required for ggml magic files ( This is because So you should not be using My implementation: https://github.com/LostRuins/koboldcpp/blob/concedo/model_adapter.cpp#L220 |
I suspect that wasn't an intentional design choice, as (If this leads to a format break - which I hope it doesn't / if it does, it's seamless - I'd really strongly suggest splitting up the ftype/qnt version, there's no reason to multiplex if strict compatibility isn't necessary) |
The explanation by @LostRuins is correct - for We can do the same encoding for the But adding a |
Ah okay, that makes sense. Well, given that, it'd be nice to write the quantization version in the ftype, but it doesn't matter too much as we can determine it from the format version. With that being said, if #1575 (comment) / ggerganov/ggml#147 gets implemented, I'd suggest moving the file format loader/saver to This would have a few benefits:
|
@philpax I'm personally more in the camp of - if it's not broken don't fix it - so given that the version+ftype multiplex was already added to existing ggml and currently working I'm not really in favor of changing it. I'm wondering what you mean by "tell user it's unsupported instead of guessing", if everything uses the same ggjt magic you still won't be able to tell a neoX from a ggjt model except by maybe binning the vocab ranges. Edit: That said, switching from a packed struct to the Key-value metadata you suggested in #1575 seems like a good idea if it can be implemented robustly. A few fields should be enforced for all models for a model to be considered compliant. Certainly would be good to definitively tell apart the different architectures, such as GPT2 and GPTJ. A final magic change should be considered if this ends up being implemented. |
Yup, you got it - if we have kv metadata, we can store the architecture in there too, and that should finally solve the issues we've been having around identifying models. |
But it has to be consistent. Leaving the standard freely extensible but undefined can rapidly lead to fracturing formats as each implementation adds their own keys. That's how you get monstrosities in modern javascript such as
Last we need is for some models to use " Random ideas: Should the keys be stored as an ASCII string (fixed length? nul terminated?) Will it handle unicode? (JSON keys permit that) What's your opinion on case insensitive keys? |
Hm yeah, good points. Okay, how about snake_case ascii keys with no null termination (length comes before the string, and I think that's consistent with the tensors)? If there's demand for Unicode keys, that can be added later, but I've not seen much of that in the HF configs. The values can store as much UTF-8 as they want, but the keys are constrained to snake case. I don't want to be too prescriptive on key length, but 256 characters max should be more than sufficient. You could maaaybe exceed that if you were specifying parameters for individual tensors and needed to include their full name, but I think that's a problem we'll deal with when we get to it. By the way, we're having this discussion across several issues. @ggerganov do you mind if I create one issue on |
@philpax |
I would like to solve this. Can anyone please guide? |
Expected Behavior
When quantizing with llama.cpp, the quantization version should be written to the
ftype
in the hyperparameters.Current Behavior
A
ftype
is produced byllama_model_quantize_internal
and is passed through as-is tollama_file_saver
, which writes it to disk without encoding it usingGGML_QNT_VERSION
:llama.cpp/llama.cpp
Lines 2052 to 2068 in ac7876a
llama.cpp/llama.cpp
Line 557 in ac7876a
Loaders which are expecting the quantization version, like llm, detect a quantization version of 0:
Environment and Context
This was reproduced on ac7876a. I initially detected this when testing with one of the models on HuggingFace, then re-quantized a model locally to test it for myself.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
make
./quantize ggml-model-f16.bin ggml-model-f16-q4_0.bin q4_0
ftype
in the written hyperparameters.The text was updated successfully, but these errors were encountered: