Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix IQ1_S quantization #6287

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

CISC
Copy link
Contributor

@CISC CISC commented Mar 24, 2024

This is a suggested fix for the latest updated IQ1_S quantization which fails catastrophically with certain imatrixes.

Testing by @ikawrakow shows PPL suffer a little, but there doesn't seem to be any real-world adverse effects to quantizations that were not showing the same catastrophic results earlier, therefore I would like to open up a discussion and hopefully organize some wider testing by anyone willing to help. My initial thoughts go to @dranger003 @Nexesenex @schmorp and @Artefact2

See the linked issue for details.

Fixes #5996

@Nexesenex
Copy link
Contributor

Nexesenex commented Mar 26, 2024

I'm in my own testing process about new sub 2bpw quant strategies (which quant is applied to each tensor for an optimal result, PR is here : #6310), and a minor perplexity loss on the GGML_Type IQ1_S is in my opinion not a problem if it fixes catastrophic issues on some models.

The smaller a weight is (and incidently, a model), the more sensitive it is to quantization, and the perplexity loss noted on #5996 is (probably) mainly related to some sort of "loss amplification" on the attn.k weight, then on the attn.q.weight, which are in that order (due to their respective size) the most harmed by any regression over the "efficiency" of any GGML_TYPE, IQ1_S in our case (which is used for attn.q and attn.w in the current IQ1_S Llama_FTYPE).

The FFN weights are much less sensitive because they are the biggest, and on big models (the only ones usable with IQ1_S for now), it's pointless to leave attn.k.weight on IQ1_S, and attn.q.weight becomes big enough to endure a little loss in IQ1_S (and ideally goes quantized in IQ2 quants, or in the incoming IQ1_M).

@CISC CISC closed this Mar 26, 2024
@CISC CISC force-pushed the iq1s-quantize-fix branch from fc0cae1 to 557410b Compare March 26, 2024 17:53
@CISC CISC reopened this Mar 26, 2024
@CISC
Copy link
Contributor Author

CISC commented Mar 26, 2024

@Nexesenex Thank you for the feedback. much appreciated. Also looking forward to your new quant mixes. :)

I've updated the PR to include the new IQ1_S changes, and also doing the same to IQ1_M, although my imatrix does not trigger the same issue the change also does not seem to have adverse effects there either, so basically just playing it safe.

Funny thing though, the gibberish has changed nature with latest build, not sure what is causing the change, but this PR still fixes it. However even though it fixes the gibberish the model is no longer outputting the correct answer (even when using the old quantized file), so something really did change. I think this whole PR need a lot more testing to try to figure out what is really going on here. :(
It was the new repeat penalty default (1.0/disabled) that caused it, setting it explicitly to 1.1 restores old behaviour again.

@mofosyne mofosyne added Review Complexity : High Generally require indepth knowledge of LLMs or GPUs bugfix fixes an issue or bug labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug Review Complexity : High Generally require indepth knowledge of LLMs or GPUs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New IQ1_S somehow much worse than previous version
3 participants