-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IQ3_S: a much better alternative to Q3_K #5676
Conversation
* Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels
After all the experimentation, nothing was better than this.
Performance is very similar to Q3_K_S
I see the ROCm builds failing. It is claiming to not know about |
The page you linked says it isn't supported (HIP column is empty). |
Great work! Btw, wouldn't be more pertinent to rename the new Q3K_XS in.. IQ3_XS, to avoid confusion and considering that it's exactly where a IQ3_XS should be while this naming convention is still available? |
Impressive work! Thanks again @ikawrakow I am agreeing with my the former poster that it might be a good idea to think about naming conventions and possibly even have “v1/v2/v3” etc. once small improvements are made to an existing format. Would also be cool if we could find a way to optimize for ARM / Apple silicon. |
@@ -196,6 +196,17 @@ static __device__ __forceinline__ int __vsub4(const int a, const int b) { | |||
return __vsubss4(a, b); | |||
} | |||
|
|||
static __device__ __forceinline__ unsigned int __vcmpeq4(unsigned int a, unsigned int b) { | |||
const uint8x4_t& va = reinterpret_cast<const uint8x4_t&>(a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 92f9309b..9729ad73 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -172,6 +172,7 @@
#endif
typedef int8_t int8x4_t __attribute__((ext_vector_type(4)));
+typedef uint8_t uint8x4_t __attribute__((ext_vector_type(4)));
static __device__ __forceinline__ int __vsubss4(const int a, const int b) {
const int8x4_t va = reinterpret_cast<const int8x4_t&>(a);
const int8x4_t vb = reinterpret_cast<const int8x4_t&>(b);
The build fails with the smaller super blocks ( make clean && LLAMA_QKK_64=1 make -j tests && ./tests/test-quantize-fns
|
|
So, now it compiles and even works on ARM_NEON and AVX2 with |
Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards.
rocmforiq3.txt |
@ikawrakow Rocm work now. good job Some perplexity for ref
|
For the NEON performance, have you considered making the table be something you can compute (vectorized) instead of doing lookups? The bytes in the lookup table only have 8 distinct values, which makes them possibly a product of a shuffle instruction. If you start with x in [0, 512), you can do (x * 123605091) & 0x07070707 and that will give you 512 distinct sets of shuffle indices. (They don't match the exact shuffle indices required to make your table, but I bet you could get somewhat close by searching through values to replace 123605091. Is the exact structure of that table important? About 20% of numbers you plug in there end up with 512 distinct results, so you'd have a lot of potential codebooks that could be chosen from.) And, you could do all 8 of these lookups at once in one big register, I think. |
@PeterReid The order of the values in the lookup table is not important, but the specific values are. Each
So, these 512 values are totally not random and the quantization error strongly depends on the selection. This is a 3-bit quantization, so a straightforward 3-bit quantization would allow only 8 distinct values. Instead, here we have 16 distinct values (taking into account the sign), but with the restriction that only a subset of the possible combinations of 16 values is allowed. |
Good stuff, the new 3kxs is better. |
Do you happen to have those statistics in a way you can send? I would like to see how much worse the closest weighted distance would be if the codebook is constrained to be one of the ~800 million (some of those would be duplicates) that can be generated with this method. |
Agreed - I still mostly use the legacy Q4_0 and Q5_0 quants because they run fast on my Tesla P40. It would be nice to have a summary of all of the new quantization types so I know which ones to explore. I got a little discouraged after Q2_K halved in tg speed on my GPU, and since then I haven't experimented much with sub-4-bit quants. |
@PeterReid Here is one such sample file with statistics. Just binary data containing 4096 ints that are the counts for the 4096 possible combinations. In C++ simply
But I did not mention the last step in the process, which is the most tedious and lengthy: after generating a new set of points (codes), I go, change the lookup table, and run a bunch of perplexity calculations to see how this new codebook performs. I always use the 7B and 13B models of LLaMA-1 and LLaMA-v2 along with Mistral-7B. If the results are looking promising, then I also run LLaMA-1-30B and Mixtral-8x7B. If also this looks promising, then I run LLaMA-v2-70B and possibly LLaMA-v1-65B. Without this verification step it doesn't work. More often than not I made a tweak to the codebook generation, it looked to be better (lower mean squared distance of codes not in the codebook), to only get a worse PPL to the codebook that looked worse on paper. If you can find a way to encode this exact set of 512 entries via some clever trick, this would be great! |
That's strange you got Q2K that slow on P40 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
IQ4NL might be a bit slower but It's 1.2g less with quite a bit better perplexity. |
KL-divergence data for Mistral-7B (over wikitext)
|
Thank you for the details on your process, and that file! I am quite certain that I will not be able to generate that exact codebook, but am mildly hopeful that I will be able to generate one as good. |
Looking at the 3 bit quants in @Artefact2’s plot, it looks to me that the low 3 bit quants are holding up rather well in terms of top token compared with the high 3 bit quants. The difference is bigger when considering the median divergence though. I would conclude that the difference between low and high quants in the 3 bit range is less at lower temps like one would use for coding and logic. On the other hand, maybe it’s just a flatter curve overall rather than a true difference in shape. |
EDIT: It appears this bug occurs when using more than 2 threads when using I am getting invalid memory reading kmap_q3xs (looks like Line 11140 in 1289408
assert crash (8 threads)
segfault crash (3 threads)
|
@dranger003 I downloaded your model and imatrices but I cannot reproduce the problem. There is definitely no race in the code (else all quantization types would be affected as the multi-threading mechanism is exactly the same for all quants). Not sure what the issue might be. |
@ikawrakow Thanks for looking into it, appreciate it. For now, I only get the issue with that specific model and I can still get it to work if I run the quantize multiple times until it goes through, so I think this is fine. I'll report back if this becomes a larger issue. |
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq4_nl: squash commits for easier rebase * Basics (quantize, dequantize) * CUDA dequantize and dot product * Slightly faster CUDA dot product (120 t/s) * Switch to 6-bit scales * Scalar dot product * AVX2 dot product * ARM_NEON dot product * Works on metal, but still slow * Slightly better Metal dot product * Another small Metal improvement * Metal dot product is getting there * Faster CUDA dot product * Add 1/8 ffn_down layers as Q5_K when no imatrix has been provided * Report the actual bpw * Add _xs mix that is 4.05 bpw for non-MoE models * Remove IQ4_XS for now, slightly adjust kvalues_iq4nl * AVX2 dot product uses Q8_0 instead of Q8_K * Add to test-backend-ops * Minor fix * Also use use Q5_K for attn_output in MoE models * Fixes after merging latest master * Switching to blocks of 32 * AVX2 for blocks of 32 * Scaler dot product for blocks of 32 * ARM_NEON dot product for blocks of 32 * Metal kernels for blocks of 32 * Slightly faster Metal kernels * Resurrecting iq3_xs After all the experimentation, nothing was better than this. * Minor PPL improvement via a block scale fudge factor * Minor improvement via 3 neighbours * iq3_xs: working scalar and AVX2 dot products * iq3_xs: ARM_NEON dot product - works but extremely slow (10 t/s) * iq3_xs: working Metal implementation * Adding IQ3_M - IQ3_XS mix with mostly Q4_K * iiq3_xs: a 3.4375 bpw variant * iq3_xs: make CUDA work for new version * iq3_xs: make scalar and AVX2 work for new version * iq3_s: make ARM_NEON work with new version * iq3_xs: make new version work on metal Performance is very similar to Q3_K_S * iq3_xs: tiny Metal speed improvement * iq3_xs: tiny Metal speed improvement * Fix stupid warning * Q3_K_XS now uses a mix of IQ3_XS and IQ3_XXS * iq3_xs: rename to iq3_s * iq3_s: make tests pass * Move Q3_K_XS mix to 3.25 bpw * Attempt to fix failing tests * Another attempt to fix the Windows builds * Attempt to fix ROCm * ROCm again * iq3_s: partial fix for QK_K = 64 * iq3_s: make it work on metal for QK_K = 64 Pleasent surprise: the coding was super-block size independent, so all it took was to delete some QK_K == 256 guards. * Will this fix ROCm? --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This PR adds
IQ3_S
, a 3.4375 bpw quantization (i.e., the exact same size asQ3_K
) that has a significantly lower PPL cpompared toQ3_K_S
(see below).In addition
Q3_K_XS
quantization mix (a mix ofQ3_K, Q2_K
andQ4_K
) is replaced with a simpler and much better mix ofIQ3_XXS
andIQ3_S
with an approximate bpw of 3.25.IQ3_M
, a mix between the newIQ3_S
andQ4_K
. It has basically the same PPL as the existingQ3_K_M
at 0.15 bpw less.The graph shows a summary of PPL results for LLaMA-v1 and LLaMA-v2 models, plus Mistral-7B. Each point represents the ratio of the quantized PPL to the PPL of the base (
fp16
) model. The x-axis is bpw - bits-per-weight - excluding the bpw added by the higher bit quantization of theoutput.weight
tensor. The magenta circles show the results forQ3_K_S
, the orange for the newIQ3_S
. The improvement in quantization error (defined asPPL(Q)/PPL(fp16)-1
) is 40-70% depending on model. The cyan circles represent the existingQ3_K_M
quantization mix. The dark green circles are for the newIQ3_M
, showing the ~0.15 bpw saving for essentially the same quantization error. The newQ3_K_XS
mix, shown in indigo, is designed to be roughly in the middle betweenIQ3_XXS
andIQ3_S
in terms of bpw. The dashed line is for visual guidance (it connects the average of the data points at each bpw).Inference performance of the new
IQ3_S
quants is similar toQ3_K
on CUDA (RTX-4080), AVX2 (Ryzen 7950X), and Metal (30-core M2 Max). Performance on the M2 Max CPU with ARM_NEON intrinsics is pathetic - only about 10 t/s for a 7B model compared to 22.5 t/s forQ3_K_S
. TheIQ
series of quants use "codebooks" to encode groups of 4 or 8 weights. ForIQ3_S
this requires 4 memory loads from a lookup table of 2048 bytes to setup one 128-bit SIMD register. It seems Apple Silicon does not like this very much. Let's hope that someone more knowledgeable than me will be able to optimize.The extra 0.375 bits per weight spent compared to
IQ3_XXS
are due toIQ3_S
instead of 256 forIQ3_XXS
. This adds 1 bit per 4 weights, so 0.25 bpw.If this PR is accepted, one could retire the
Q3_K
quants. I haven't done that mainly for two reasons:Q3_K
quantsIQ3_S
.