Fix IQ1_S quantization #6287

CISC · 2024-03-24T20:11:09Z

This is a suggested fix for the latest updated IQ1_S quantization which fails catastrophically with certain imatrixes.

Testing by @ikawrakow shows PPL suffer a little, but there doesn't seem to be any real-world adverse effects to quantizations that were not showing the same catastrophic results earlier, therefore I would like to open up a discussion and hopefully organize some wider testing by anyone willing to help. My initial thoughts go to @dranger003 @Nexesenex @schmorp and @Artefact2

See the linked issue for details.

Fixes #5996

Nexesenex · 2024-03-26T02:09:32Z

I'm in my own testing process about new sub 2bpw quant strategies (which quant is applied to each tensor for an optimal result, PR is here : #6310), and a minor perplexity loss on the GGML_Type IQ1_S is in my opinion not a problem if it fixes catastrophic issues on some models.

The smaller a weight is (and incidently, a model), the more sensitive it is to quantization, and the perplexity loss noted on #5996 is (probably) mainly related to some sort of "loss amplification" on the attn.k weight, then on the attn.q.weight, which are in that order (due to their respective size) the most harmed by any regression over the "efficiency" of any GGML_TYPE, IQ1_S in our case (which is used for attn.q and attn.w in the current IQ1_S Llama_FTYPE).

The FFN weights are much less sensitive because they are the biggest, and on big models (the only ones usable with IQ1_S for now), it's pointless to leave attn.k.weight on IQ1_S, and attn.q.weight becomes big enough to endure a little loss in IQ1_S (and ideally goes quantized in IQ2 quants, or in the incoming IQ1_M).

CISC · 2024-03-26T18:13:45Z

@Nexesenex Thank you for the feedback. much appreciated. Also looking forward to your new quant mixes. :)

I've updated the PR to include the new IQ1_S changes, and also doing the same to IQ1_M, although my imatrix does not trigger the same issue the change also does not seem to have adverse effects there either, so basically just playing it safe.

Funny thing though, the gibberish has changed nature with latest build, not sure what is causing the change, but this PR still fixes it. However even though it fixes the gibberish the model is no longer outputting the correct answer (even when using the old quantized file), so something really did change. I think this whole PR need a lot more testing to try to figure out what is really going on here. :(
It was the new repeat penalty default (1.0/disabled) that caused it, setting it explicitly to 1.1 restores old behaviour again.

CISC closed this Mar 26, 2024

CISC force-pushed the iq1s-quantize-fix branch from fc0cae1 to 557410b Compare March 26, 2024 17:53

Update ggml-quants.c

d674812

CISC reopened this Mar 26, 2024

mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs bugfix fixes an issue or bug labels May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix IQ1_S quantization #6287

Fix IQ1_S quantization #6287

CISC commented Mar 24, 2024

Nexesenex commented Mar 26, 2024 •

edited

Loading

CISC commented Mar 26, 2024 •

edited

Loading

Fix IQ1_S quantization #6287

Are you sure you want to change the base?

Fix IQ1_S quantization #6287

Conversation

CISC commented Mar 24, 2024

Nexesenex commented Mar 26, 2024 • edited Loading

CISC commented Mar 26, 2024 • edited Loading

Nexesenex commented Mar 26, 2024 •

edited

Loading

CISC commented Mar 26, 2024 •

edited

Loading