-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ1_M: 1.75 bpw quantization #6302
Conversation
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B
There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
Works, but very slow (10.5 t/s)
About the same performance as iq1_s.
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K.
10.5 t/s -> 11.65 t/s
11.65 t/s -> 14.9 t/s
14.9 -> 15.0 t/s
After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624
We have progressed to warnings being errors.
@ikawrakow Thank you so much, man! I was almost done with my IQ1_S strategy, Mixtral caused me trouble (it's heavy to requant it endlessly) but I found my mistake and now it works as intended, with sizeable improvements on perplexity and often on ARC benches. I will PR tonight or tomorrow an 1Q1_XS LLAMA_FTYPE, which offers an almost comparable quality to your current IQ1_S LLAMA_FTYPE with a slight reduction in size, to act as a new "demo of the smallest quant", before being refactored with IQ1_M GGML_Type for an ulterior PR. As for the IQ1_S LLAMA_TYPE I revamped, it's almost ready as well, and will follow shortly after in another PR, before being refactored with IQ1_M GGML_Type as well for an ulterior PR. Then I'll (and/or you and/or anyone lol) work on a derived IQ1_M LLAMA_FTYPE to make the best sub 2bpw quant possible. |
ggml.h
Outdated
GGML_TYPE_I16 = 26, | ||
GGML_TYPE_I32 = 27, | ||
GGML_TYPE_I64 = 28, | ||
GGML_TYPE_F64 = 29, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to also update the enum in gguf-py/gguf/constants.py
:
llama.cpp/gguf-py/gguf/constants.py
Lines 681 to 708 in deb7240
class GGMLQuantizationType(IntEnum): | |
F32 = 0 | |
F16 = 1 | |
Q4_0 = 2 | |
Q4_1 = 3 | |
Q5_0 = 6 | |
Q5_1 = 7 | |
Q8_0 = 8 | |
Q8_1 = 9 | |
Q2_K = 10 | |
Q3_K = 11 | |
Q4_K = 12 | |
Q5_K = 13 | |
Q6_K = 14 | |
Q8_K = 15 | |
IQ2_XXS = 16 | |
IQ2_XS = 17 | |
IQ3_XXS = 18 | |
IQ1_S = 19 | |
IQ4_NL = 20 | |
IQ3_S = 21 | |
IQ2_S = 22 | |
IQ4_XS = 23 | |
I8 = 24 | |
I16 = 25 | |
I32 = 26 | |
I64 = 27 | |
F64 = 28 |
Also, move GGML_TYPE_IQ1_M
at the end of the enum to keep backwards compatibility with any GGUF files that might have started using integer of 64-bit types
@ikawrakow, the IQ1_M quant is like twice slower to quantize than IQ1_S (on a I7-6700K with AVX and AVX2 enabled). Is there something to do about that? |
Sorry, I did not see a way to make it more efficient. It is doing 4X the work, so being 2X slower is not too bad. Both, |
I do have another version of The reason I'm reluctant to make a PR is that it uses an even larger codebook (4096 entries vs 2048 in |
@ikawrakow I understand that speed on all platforms has its relative importance in the final choices, like size do, but it's a pity to leave such jewels on a shelf! Could you eventually share the quant as a "CUDA optimized quant" for those interested to use it? Ultimately, even if the approach "one quant for all archs" is pertinent for the sake of optimal compatibility, the differences in architectures should also be accounted for as well to not only rely on the "common denominator", but also on the "best for each case" in order to have SOTA quants for most of "broad particular cases", like CUDA is. In my opinion, if LlamaCPP doesn't integrate this approach, some others will eventually. |
* iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq1_m: basics * iq1_m: basics-2 * iq1_m: CUDA dequantize works Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B. * iq1_m: separate shifts for each group of 8 in a block We get PPL(LLaMA-v2-7B ) = 9.2810 PPL(LLaMA-v2-13B) = 6.8105 Not bad, but slightly higher than sqrt(PPL(IQ1_S) * PPL(IQ2_XXS)) which is the expected outcome given that IQ1_M is halfway between IQ1_S and IQ2_XXS in terms of bpw. From this, we would expect PPL = 9.14 for LLaMA-v2-7B PPL = 6.63 for LLaMA-v2-13B * iq1_m: go to 3-bit scales There is slight increase in PPL, but the 0.0625 bpw reduction in size is totally worth it. We now have PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw * iq1_m: scalar dot product * iq1_m: AVX2 dot product * iq1_m: very slightly faster AVX2 dot product * iq1_m: ARM_NEON dot product Works, but very slow (10.5 t/s) * iq1_m: Metal - dequantize works, dot product does not * iq1_m: Metal now works About the same performance as iq1_s. * iq1_m: minor * iq1_m: checking pure iq1_m quantization It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight with Q4_K. * iiq1_m: slightly faster ARM_NEON dot product 10.5 t/s -> 11.65 t/s * iq1_m: faster ARM_NEON dot product 11.65 t/s -> 14.9 t/s * iq1_m: another minor ARM_NEON dot product improvement 14.9 -> 15.0 t/s * iq1_m: small PPL improvement via super-block scale adjustment After quantizing block scales redo the super-block scale fit. PPL(LLaMA-v2-7B ) = 9.3346 PPL(LLaMA-v2-13B) = 6.8419 PPL(LLaMA-v2-70B) = 4.8294 PPL(Mistral-7B ) = 8.1624 * iq1_m: adapt to CUDA refactoring * iq1_m: remove unused variable We have progressed to warnings being errors. * iq1_m: add to backend-ops tests * iq1_m: fix Windows ARM * iq1_m: use common definition of iq1m_scale_t * cuda: assert -> NO_DEVICE_CODE * iq1_M: PR comments --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Bring `GGMLQuantizationType` up to date; adds `I8`, `I16`, `I32`, `I64`, `F64`, `IQ1_M` and `BF16`. Added in: * ggerganov/llama.cpp#6045 * ggerganov/llama.cpp#6062 * ggerganov/llama.cpp#6302 * ggerganov/llama.cpp#6412
While waiting for the 1.58 bit era...
Compared to
IQ1_S
:IQ1_S
. Scales are 3 bit, so 3/16 bpwAlong with the
fp16
super-block scale this ends up being exactly 1.75 bpw.The table shows a PPL comparison between
IQ1_S
andIQ1_M
(this PR). Context is 2048 tokens for LLaMA-v1 and 4096 for all other models. The last column shows therms_norm_epsilon
used to generate the PR results.IQ1_S
)IQ1_M
)@Nexesenex Looking forward to your improved 2.0 / sub-2.0 bpw quantization mixes.