-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ1_XS FTYPE quant strategy #6310
IQ1_XS FTYPE quant strategy #6310
Conversation
LLAMA_FTYPE should be GGML_TYPE there.
From 31 to 32, because IQ1_M will come with 31.
Q2_K embed for GQ4 because it helps Mistral 7b. I didn't test a model with attn.qkv weight, so better to be conservative with a K-Quant.
I confirmed your other values, but this one is wrong: I get |
I didn't change the rms_norm_epsilon value when testing. I will download a fresh Llama_2_7B and remake a fp16 to retest, I need to test Q4_K output anyway. Note : I closed/reopened the PR by mistake. :X Edit : Now my results for IQ1_S are in line with yours for Llama 2 7b. I'm retesting IQ1_XS now. Edit 2 : Llama 2 scores corrected. Now I move onto the output tensor. |
Please update the Python constants too (used e.g. by gguf-dump.py). |
I might be wrong, but looking at the code, I think it applies only to GGML_TYPE (tensor quantization), not to LLAMA_FTYPE (quantization mix strategy). |
- There's indeed a slight bonus worthy of not being missed for such a cheap cost with Q4_K compared to IQ4_XS, especially on the K & V attention tensors. - Obsession on size doesn't matter much for the smallest models which are small anyway and need an offset toward quality for the sake of logic, while the bigger models which can actually be usable almost won't be impacted in size but will appreciate the slight quality bump offered by Q4_K vs IQ4_XS.
It looks like you're right and since there's no change to ggml.h, neither is a change to gguf constants.py necessary; sorry for the noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The quantization failing when the token embeddings tensor is quantized in IQ2_S, due to the iMatrix absence warning issue, is now solved after adding an exception.
As for the IQ4_XS vs Q4_K question for some tensors/cases, Q4_K is chosen, in line again with @ikawrakow's remarks & also with my own concurring recollection of some past testing made to prepare this PR, results that I initially dismissed in a too size-shrinking oriented approach.
On my side, I think this PR is ready for pre-merge review.
After further testing, this PR can still be improved (Arc and Winogrande results, maybe use IQ1_M for some tensors) and thus I convert it to draft, time to dig a bit more. |
I'm sorry, I'm quite new to quantization technology and I'm curious if these technologies offer improvements for quantizing other bits. I'm aware that the low-bit quantization performance of larger models is generally better than the high-precision quantization results of smaller models. However, I'm uncertain about which approach has more potential from a cost-effectiveness perspective in the future: dig below 2 bits or optimize2-3 bits. I've personally conducted a subjective test using a 72b model (qwen1.5-72b-chat), and I've found that the difference in model performance above iq3_s precision is imperceptible. However, how do we assess the loss in performance from a human perspective between q1_xs and iq2_m? |
I found during my test that the sweet spot quality/size ratio was between 2.3 and 2.5bpw. quality/speed is something I didn't test, and is of course very impactful. At 2.5bpw, and aside of coding which usually requires more precision accordingly to what I read, we can start to really use a model (MOE, 70b, even 34b to some extend) even if the quality is ofc lower than in 3bpw+. All my tests are aimed to find best quality/size spots for quantization strategy with the GGML quants provided by @ikawrakow. As for the future, it depends on the model and hardware you run, but 2.3-2.5 bpw in particular, and 2-3bpw in general is, in my uneducated opinion, the place to dig until new SOTA-1.58bpw (@ikawrakow mentioned that recently) appear and maybe offset the best quality/size game toward lower quants strategies. As for the testing, get a look at LocalLLama on reddit, there's a bunch of people making tests there. And you can simply try the models with different quants and the same prompt, or use the benchmarks included in LlamaCPP (Arc, Winogrande, Hellaswag, etc) to get some measurements between different quants. |
- IQ4_XS output for models lesser than 8 experts or GQA 8 - granularity for QKV tensor when existing - also, drop for Mistral & Yi attn.k.weight from IQ2_XS to IQ2_XXS
attn.v.weight in Q4_K for all MOEs & models with GQA4, Mistral (PPL4096 benefits quite a lot) and incidentally CodeLlama34b (which is for coding anyway and isn't exploitable in IQ1 quants). Yi 34b gets IQ3_S for now, more tests are needed due to perplexity huge increase problems with IQ4_XS and Q4_K for attn.v.weight on my test model (Kyllene 1.1).
Considering that @ikawrakow's IQ Quants brought us the SOTA quantization available as of now, and yet disappointed by the lack of usability of the IQ1_S model quant below 70b (and even still), I wondered if there couldn't be a better "mixed" strategy to drop furthermore the quality/size ratio of the sub-2bpw model quants, and bring them in line with the other IQ LLAMA_FTYPE strategies.
I tested a lot of combinations, out of the known patterns of quantization mix, an extremely basic understanding of what tensor/weight does what, and also some sense of proportions, and here comes a slightly different model quant strategy using the current IQ1_S GGML_TYPE, which can easily be scaled upward from this LLAMA_FTYPE IQ1_XS model quantization. I did already the scaling-up to a IQ1_S replacement candidate to follow soon if the demarch is approved, and it's very satisfactory, not to speak about the incoming IQ1_M GGML_Type to improve the IQ1_S FTYPE furthermore, and make a scaled IQ1_M FTYPE after that.
The IQ1_XS strategy is as follows :
IQ1_XS PR vs IQ1_S "Even Better" master :
Perplexity at 512 ctx :
Perplexity at 4096 ctx :
I didn't work much on Mistral Instruct 7b 0.2, and there's a small regression in quality/size on this model merely in line with the reduced size.
Llama 2 70b IQ1_XS is also likely to be very close to the current IQ1_S (I bumped attn.k.weight from IQ2_XS to IQ2_S since my last test, and i had a 1.5% perplexity bump vs the current IQ1_S "Even Better"), at 1.65-1.66bpw instead of 1.69bpw.
Such strategy can already scale well in the interval between FTYPE IQ1_S and IQ2_XXS, and there's also a more elusive but nevertheless existing margin of progress to reach beyond.
Then, the new IQ1_M proposed by @ikawrakow (thanks again!) will help a lot in the attempt to get a really usable 2.0/sub 2bpw quant strategy, and an IQ 4.5+bpw and IQ 5+bpw made available could help as well to refine furthermore the small tensors and the output tensor!
At the end of the day, the IQ1_S GGML_Type is VERY useful to quantize the ffn tensors, especially up.ffn and gate.ffn (there's some experiments to do on these 2 also by varying their ratio around a down.ffn "pillar"), which represent most of the size of a model, are the least sensitive to low bpw quantization, while the smaller tensors are much more sensitive and can be beefed up without much size increase.
Tests and feedback will be appreciated!
Footnote : this is my first "real" PR. I don't know much about code, sorry for the bulky formatting. I had to choose between the 2 different approaches (per quant strategy / per tensor), and I chose the first one because a "small tree" is the most obvious logical shape for me!
Edit : Llama 2 7b scores are corrected.
Edit : IQ4_XS tensors pushed to Q4_K to focus on quality with a minor size increase.