IQ1_XS FTYPE quant strategy #6310

Nexesenex · 2024-03-26T02:01:53Z

Considering that @ikawrakow's IQ Quants brought us the SOTA quantization available as of now, and yet disappointed by the lack of usability of the IQ1_S model quant below 70b (and even still), I wondered if there couldn't be a better "mixed" strategy to drop furthermore the quality/size ratio of the sub-2bpw model quants, and bring them in line with the other IQ LLAMA_FTYPE strategies.

I tested a lot of combinations, out of the known patterns of quantization mix, an extremely basic understanding of what tensor/weight does what, and also some sense of proportions, and here comes a slightly different model quant strategy using the current IQ1_S GGML_TYPE, which can easily be scaled upward from this LLAMA_FTYPE IQ1_XS model quantization. I did already the scaling-up to a IQ1_S replacement candidate to follow soon if the demarch is approved, and it's very satisfactory, not to speak about the incoming IQ1_M GGML_Type to improve the IQ1_S FTYPE furthermore, and make a scaled IQ1_M FTYPE after that.

The IQ1_XS strategy is as follows :

The token embedding weight goes from Q2_K to IQ2_S, except for MOEs like Mixtral who does not seem like 2 bits IQ Quants for this tensor.
The quite determinant attn.v.weight scales up in respect of the GQA parameter and furthermore of the number of experts, because the relative size of these small tensors shrinks quickly compared to the rest of the GQA/MOE models. Same idea for attn.k.weight, but with a lower quantization quality.
In this IQ1_XS FTYPE, which is to be the smallest FTYPE available for now if accepted, I do not touch the FFN weights, which all remain in IQ1_S.
The output weight goes from Q5_K to IQ4_XS, except for the MOEs (for somehow similar reason than the embedding weight, although the difference is lesser).
For the MOEs, the attn.q.weight is pushed to IQ2 quants, XXS, XS or S accordingly to the number of experts : Having Q4_K (now IQ4_XS/Q5_K) K/V tensors and only IQ1_S Q tensor was not optimal on a MOE, for it seems that the importance of the integrity of the Query/Key/Value triplet is particularly high on MOEs. And considering that a attn.q.weight is shared between the 8 experts, the size increase is minimal.
I numbered IQ1_XS as "32" in the llama.h enumeration, for "31" will likely be taken by the incoming IQ1_M FTYPE prepared by @ikawrakow.

IQ1_XS PR vs IQ1_S "Even Better" master :

Perplexity at 512 ctx :

Llama 2 7b : 14.6916 vs 13.8991, at 1.76bpw instead of 1.81bpw
Llama 2 7b RMS 1.875e-5 : 14.7964 vs 13.8402, at 1.76bpw instead of 1.81bpw
Mistral Instruct 7b v0.2 : 12.1735 vs 11.8538, at 1.72bpw instead of 1.78bpw
Yi 34b (Kyllene 1.1) : 9.5278 vs 9.8761, at 1.69bpw instead of 1.74bpw
Mixtral Instruct 0.1 : 7.267 vs 7.3085, at 1.63bpw instead of 1.68bpw

Perplexity at 4096 ctx :

Llama 2 7b : 12.6731 vs 12.6402 at 1.76bpw instead of 1.81bpw
Llama 2 7b RMS 1.875e-5 : 12.4292 vs 11.8383 at 1.76bpw instead of 1.81bpw
Mistral Instruct 7b v0.2 : 9.6538 vs 9.2874, at 1.72bpw instead of 1.78bpw
Yi 34b (Kyllene 1.1) : 7.6289 vs 7.8954, at 1.69bpw instead of 1.74bpw
Mixtral Instruct 0.1 : 6.0612 vs 6.0789, at 1.63bpw instead of 1.68bpw

I didn't work much on Mistral Instruct 7b 0.2, and there's a small regression in quality/size on this model merely in line with the reduced size.

Llama 2 70b IQ1_XS is also likely to be very close to the current IQ1_S (I bumped attn.k.weight from IQ2_XS to IQ2_S since my last test, and i had a 1.5% perplexity bump vs the current IQ1_S "Even Better"), at 1.65-1.66bpw instead of 1.69bpw.

Such strategy can already scale well in the interval between FTYPE IQ1_S and IQ2_XXS, and there's also a more elusive but nevertheless existing margin of progress to reach beyond.

Then, the new IQ1_M proposed by @ikawrakow (thanks again!) will help a lot in the attempt to get a really usable 2.0/sub 2bpw quant strategy, and an IQ 4.5+bpw and IQ 5+bpw made available could help as well to refine furthermore the small tensors and the output tensor!

At the end of the day, the IQ1_S GGML_Type is VERY useful to quantize the ffn tensors, especially up.ffn and gate.ffn (there's some experiments to do on these 2 also by varying their ratio around a down.ffn "pillar"), which represent most of the size of a model, are the least sensitive to low bpw quantization, while the smaller tensors are much more sensitive and can be beefed up without much size increase.

Tests and feedback will be appreciated!

Footnote : this is my first "real" PR. I don't know much about code, sorry for the bulky formatting. I had to choose between the 2 different approaches (per quant strategy / per tensor), and I chose the first one because a "small tree" is the most obvious logical shape for me!

Edit : Llama 2 7b scores are corrected.

Edit : IQ4_XS tensors pushed to Q4_K to focus on quality with a minor size increase.

b2532

LLAMA_FTYPE should be GGML_TYPE there.

From 31 to 32, because IQ1_M will come with 31.

Q2_K embed for GQ4 because it helps Mistral 7b. I didn't test a model with attn.qkv weight, so better to be conservative with a K-Quant.

llama.cpp

ikawrakow · 2024-03-26T09:27:10Z

Perplexity at 4096 ctx :

Llama 2 7b : 12.7689 vs 13.3220 at 1.76bpw instead of 1.81bpw

I confirmed your other values, but this one is wrong: I get PPL = 11.86 for LLaMA-v2-7B for IQ1_S on master.

Nexesenex · 2024-03-26T09:41:38Z

Perplexity at 4096 ctx :

Llama 2 7b : 12.7689 vs 13.3220 at 1.76bpw instead of 1.81bpw

I confirmed your other values, but this one is wrong: I get PPL = 11.86 for LLaMA-v2-7B for IQ1_S on master.

I didn't change the rms_norm_epsilon value when testing. I will download a fresh Llama_2_7B and remake a fp16 to retest, I need to test Q4_K output anyway.

Note : I closed/reopened the PR by mistake. :X

Edit : Now my results for IQ1_S are in line with yours for Llama 2 7b. I'm retesting IQ1_XS now.

Edit 2 : Llama 2 scores corrected. Now I move onto the output tensor.

Nindaleth · 2024-03-26T11:39:26Z

Please update the Python constants too (used e.g. by gguf-dump.py).

Nexesenex · 2024-03-26T12:25:00Z

Please update the Python constants too (used e.g. by gguf-dump.py).

I might be wrong, but looking at the code, I think it applies only to GGML_TYPE (tensor quantization), not to LLAMA_FTYPE (quantization mix strategy).

- There's indeed a slight bonus worthy of not being missed for such a cheap cost with Q4_K compared to IQ4_XS, especially on the K & V attention tensors. - Obsession on size doesn't matter much for the smallest models which are small anyway and need an offset toward quality for the sake of logic, while the bigger models which can actually be usable almost won't be impacted in size but will appreciate the slight quality bump offered by Q4_K vs IQ4_XS.

Nindaleth · 2024-03-26T13:34:57Z

It looks like you're right and since there's no change to ggml.h, neither is a change to gguf constants.py necessary; sorry for the noise.

Nexesenex

The quantization failing when the token embeddings tensor is quantized in IQ2_S, due to the iMatrix absence warning issue, is now solved after adding an exception.

As for the IQ4_XS vs Q4_K question for some tensors/cases, Q4_K is chosen, in line again with @ikawrakow's remarks & also with my own concurring recollection of some past testing made to prepare this PR, results that I initially dismissed in a too size-shrinking oriented approach.

On my side, I think this PR is ready for pre-merge review.

llama.cpp

Nexesenex · 2024-03-27T12:50:08Z

After further testing, this PR can still be improved (Arc and Winogrande results, maybe use IQ1_M for some tensors) and thus I convert it to draft, time to dig a bit more.

DesperateZero · 2024-03-28T04:11:51Z

I'm sorry, I'm quite new to quantization technology and I'm curious if these technologies offer improvements for quantizing other bits. I'm aware that the low-bit quantization performance of larger models is generally better than the high-precision quantization results of smaller models. However, I'm uncertain about which approach has more potential from a cost-effectiveness perspective in the future: dig below 2 bits or optimize2-3 bits. I've personally conducted a subjective test using a 72b model (qwen1.5-72b-chat), and I've found that the difference in model performance above iq3_s precision is imperceptible. However, how do we assess the loss in performance from a human perspective between q1_xs and iq2_m?

Nexesenex · 2024-03-28T13:02:22Z

I'm sorry, I'm quite new to quantization technology and I'm curious if these technologies offer improvements for quantizing other bits. I'm aware that the low-bit quantization performance of larger models is generally better than the high-precision quantization results of smaller models. However, I'm uncertain about which approach has more potential from a cost-effectiveness perspective in the future: dig below 2 bits or optimize2-3 bits. I've personally conducted a subjective test using a 72b model (qwen1.5-72b-chat), and I've found that the difference in model performance above iq3_s precision is imperceptible. However, how do we assess the loss in performance from a human perspective between q1_xs and iq2_m?

I found during my test that the sweet spot quality/size ratio was between 2.3 and 2.5bpw. quality/speed is something I didn't test, and is of course very impactful.

At 2.5bpw, and aside of coding which usually requires more precision accordingly to what I read, we can start to really use a model (MOE, 70b, even 34b to some extend) even if the quality is ofc lower than in 3bpw+. All my tests are aimed to find best quality/size spots for quantization strategy with the GGML quants provided by @ikawrakow.

As for the future, it depends on the model and hardware you run, but 2.3-2.5 bpw in particular, and 2-3bpw in general is, in my uneducated opinion, the place to dig until new SOTA-1.58bpw (@ikawrakow mentioned that recently) appear and maybe offset the best quality/size game toward lower quants strategies.

As for the testing, get a look at LocalLLama on reddit, there's a bunch of people making tests there. And you can simply try the models with different quants and the same prompt, or use the benchmarks included in LlamaCPP (Arc, Winogrande, Hellaswag, etc) to get some measurements between different quants.

- IQ4_XS output for models lesser than 8 experts or GQA 8 - granularity for QKV tensor when existing - also, drop for Mistral & Yi attn.k.weight from IQ2_XS to IQ2_XXS

attn.v.weight in Q4_K for all MOEs & models with GQA4, Mistral (PPL4096 benefits quite a lot) and incidentally CodeLlama34b (which is for coding anyway and isn't exploitable in IQ1 quants). Yi 34b gets IQ3_S for now, more tests are needed due to perplexity huge increase problems with IQ4_XS and Q4_K for attn.v.weight on my test model (Kyllene 1.1).

github-actions · 2024-05-11T06:34:39Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 539 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8633.01ms p(95)=21269.17ms fails=, finish reason: stop=475 truncated=64
Prompt processing (pp): avg=103.51tk/s p(95)=470.11tk/s
Token generation (tg): avg=32.18tk/s p(95)=45.58tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=Nexesenex-IQ1_XS-IQ1_S-quant-strategies commit=e4ac8ae720847b674c681e3b6218fc1c67683725

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 259.46, 259.46, 259.46, 259.46, 259.46, 617.4, 617.4, 617.4, 617.4, 617.4, 664.09, 664.09, 664.09, 664.09, 664.09, 692.14, 692.14, 692.14, 692.14, 692.14, 752.74, 752.74, 752.74, 752.74, 752.74, 763.64, 763.64, 763.64, 763.64, 763.64, 765.06, 765.06, 765.06, 765.06, 765.06, 782.9, 782.9, 782.9, 782.9, 782.9, 797.56, 797.56, 797.56, 797.56, 797.56, 813.91, 813.91, 813.91, 813.91, 813.91, 841.28, 841.28, 841.28, 841.28, 841.28, 855.75, 855.75, 855.75, 855.75, 855.75, 879.01, 879.01, 879.01, 879.01, 879.01, 913.31, 913.31, 913.31, 913.31, 913.31, 793.93, 793.93, 793.93, 793.93, 793.93, 795.32, 795.32, 795.32, 795.32, 795.32, 794.82, 794.82, 794.82, 794.82, 794.82, 780.69, 780.69, 780.69, 780.69, 780.69, 778.47, 778.47, 778.47, 778.47, 778.47, 780.77, 780.77, 780.77, 780.77, 780.77, 788.26, 788.26, 788.26, 788.26, 788.26, 787.96, 787.96, 787.96, 787.96, 787.96, 793.05, 793.05, 793.05, 793.05, 793.05, 792.16, 792.16, 792.16, 792.16, 792.16, 795.65, 795.65, 795.65, 795.65, 795.65, 796.1, 796.1, 796.1, 796.1, 796.1, 808.13, 808.13, 808.13, 808.13, 808.13, 804.94, 804.94, 804.94, 804.94, 804.94, 805.29, 805.29, 805.29, 805.29, 805.29, 806.58, 806.58, 806.58, 806.58, 806.58, 802.26, 802.26, 802.26, 802.26, 802.26, 804.28, 804.28, 804.28, 804.28, 804.28, 802.82, 802.82, 802.82, 802.82, 802.82, 808.53, 808.53, 808.53, 808.53, 808.53, 820.35, 820.35, 820.35, 820.35, 820.35, 825.25, 825.25, 825.25, 825.25, 825.25, 826.45, 826.45, 826.45, 826.45, 826.45, 819.85, 819.85, 819.85, 819.85, 819.85, 818.59, 818.59, 818.59, 818.59, 818.59, 821.11, 821.11, 821.11, 821.11, 821.11, 822.64, 822.64, 822.64, 822.64, 822.64, 827.86, 827.86, 827.86, 827.86, 827.86, 825.34, 825.34, 825.34, 825.34, 825.34, 832.96, 832.96, 832.96, 832.96, 832.96, 833.33, 833.33, 833.33, 833.33, 833.33, 831.62, 831.62, 831.62, 831.62, 831.62, 829.76, 829.76, 829.76, 829.76, 829.76, 835.21, 835.21, 835.21, 835.21, 835.21, 838.04, 838.04, 838.04, 838.04, 838.04, 837.58, 837.58, 837.58, 837.58, 837.58, 838.88, 838.88, 838.88, 838.88, 838.88, 842.4, 842.4, 842.4, 842.4, 842.4, 846.84, 846.84, 846.84, 846.84, 846.84, 845.72, 845.72, 845.72, 845.72, 845.72, 847.59, 847.59, 847.59, 847.59, 847.59, 851.5, 851.5, 851.5, 851.5, 851.5, 852.55, 852.55, 852.55, 852.55, 852.55, 852.36, 852.36, 852.36, 852.36, 852.36, 853.52, 853.52, 853.52, 853.52, 853.52, 854.35, 854.35, 854.35, 854.35, 854.35, 854.52, 854.52, 854.52]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 31.73, 31.73, 31.73, 31.73, 31.73, 29.13, 29.13, 29.13, 29.13, 29.13, 29.73, 29.73, 29.73, 29.73, 29.73, 30.82, 30.82, 30.82, 30.82, 30.82, 31.49, 31.49, 31.49, 31.49, 31.49, 33.15, 33.15, 33.15, 33.15, 33.15, 33.7, 33.7, 33.7, 33.7, 33.7, 34.25, 34.25, 34.25, 34.25, 34.25, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.54, 34.34, 34.34, 34.34, 34.34, 34.34, 33.46, 33.46, 33.46, 33.46, 33.46, 33.39, 33.39, 33.39, 33.39, 33.39, 32.24, 32.24, 32.24, 32.24, 32.24, 32.42, 32.42, 32.42, 32.42, 32.42, 32.59, 32.59, 32.59, 32.59, 32.59, 32.63, 32.63, 32.63, 32.63, 32.63, 31.97, 31.97, 31.97, 31.97, 31.97, 31.47, 31.47, 31.47, 31.47, 31.47, 31.48, 31.48, 31.48, 31.48, 31.48, 31.33, 31.33, 31.33, 31.33, 31.33, 31.49, 31.49, 31.49, 31.49, 31.49, 31.27, 31.27, 31.27, 31.27, 31.27, 31.42, 31.42, 31.42, 31.42, 31.42, 31.43, 31.43, 31.43, 31.43, 31.43, 31.52, 31.52, 31.52, 31.52, 31.52, 31.21, 31.21, 31.21, 31.21, 31.21, 31.03, 31.03, 31.03, 31.03, 31.03, 31.22, 31.22, 31.22, 31.22, 31.22, 31.34, 31.34, 31.34, 31.34, 31.34, 31.53, 31.53, 31.53, 31.53, 31.53, 31.7, 31.7, 31.7, 31.7, 31.7, 31.82, 31.82, 31.82, 31.82, 31.82, 31.79, 31.79, 31.79, 31.79, 31.79, 31.75, 31.75, 31.75, 31.75, 31.75, 31.52, 31.52, 31.52, 31.52, 31.52, 31.31, 31.31, 31.31, 31.31, 31.31, 31.28, 31.28, 31.28, 31.28, 31.28, 31.48, 31.48, 31.48, 31.48, 31.48, 31.59, 31.59, 31.59, 31.59, 31.59, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.65, 31.55, 31.55, 31.55, 31.55, 31.55, 30.93, 30.93, 30.93, 30.93, 30.93, 30.29, 30.29, 30.29, 30.29, 30.29, 29.54, 29.54, 29.54, 29.54, 29.54, 29.45, 29.45, 29.45, 29.45, 29.45, 29.46, 29.46, 29.46, 29.46, 29.46, 29.63, 29.63, 29.63, 29.63, 29.63, 29.69, 29.69, 29.69, 29.69, 29.69, 29.72, 29.72, 29.72, 29.72, 29.72, 29.77, 29.77, 29.77, 29.77, 29.77, 29.75, 29.75, 29.75, 29.75, 29.75, 29.63, 29.63, 29.63, 29.63, 29.63, 29.61, 29.61, 29.61, 29.61, 29.61, 29.68, 29.68, 29.68, 29.68, 29.68, 29.8, 29.8, 29.8, 29.8, 29.8, 29.93, 29.93, 29.93, 29.93, 29.93, 30.05, 30.05, 30.05, 30.05, 30.05, 30.1, 30.1, 30.1]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.35, 0.35, 0.35, 0.35, 0.35, 0.17, 0.17, 0.17, 0.17, 0.17, 0.19, 0.19, 0.19, 0.19, 0.19, 0.22, 0.22, 0.22, 0.22, 0.22, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.15, 0.15, 0.15, 0.15, 0.15, 0.1, 0.1, 0.1, 0.1, 0.1, 0.19, 0.19, 0.19, 0.19, 0.19, 0.24, 0.24, 0.24, 0.24, 0.24, 0.18, 0.18, 0.18, 0.18, 0.18, 0.22, 0.22, 0.22, 0.22, 0.22, 0.31, 0.31, 0.31, 0.31, 0.31, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.28, 0.28, 0.28, 0.28, 0.28, 0.13, 0.13, 0.13, 0.13, 0.13, 0.24, 0.24, 0.24, 0.24, 0.24, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.14, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.29, 0.29, 0.29, 0.29, 0.29, 0.26, 0.26, 0.26, 0.26, 0.26, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.13, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.25, 0.25, 0.25, 0.25, 0.25, 0.2, 0.2, 0.2, 0.2, 0.2, 0.31, 0.31, 0.31, 0.31, 0.31, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.27, 0.27, 0.27, 0.27, 0.27, 0.35, 0.35, 0.35, 0.35, 0.35, 0.48, 0.48, 0.48, 0.48, 0.48, 0.58, 0.58, 0.58, 0.58, 0.58, 0.48, 0.48, 0.48, 0.48, 0.48, 0.22, 0.22, 0.22, 0.22, 0.22, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.16, 0.16, 0.16, 0.16, 0.16, 0.28, 0.28, 0.28, 0.28, 0.28, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 539 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715408648 --> 1715409272
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0]

Nexesenex added 13 commits March 25, 2024 19:13

b2532

f4949bc

b2532

Update quantize.cpp - Quant option IQ1_XS

8f7a7ee

Update llama.h - Enum IQ1_XS

3d88431

Update llama.cpp - Case IQ1_XS

8eff402

Update llama.cpp - Fix possible typo

51ff04e

LLAMA_FTYPE should be GGML_TYPE there.

Update llama.cpp - Embeddings and output tensors strategy.

1c4da5d

Update llama.cpp - Non-FFN layer-tensors strategy

ddc7701

Update llama.h - change IQ1_XS enum number

b355333

From 31 to 32, because IQ1_M will come with 31.

Update llama.cpp - adjustements non-FFN layer tensors

066efbb

Update llama.cpp - correction wrong case declaration

3031c01

Update quantize.cpp - mix label

9c27b0e

Update llama.cpp - correction embd.weight GQA-4 & qkv.weight to K-Quants

f162b2e

Q2_K embed for GQ4 because it helps Mistral 7b. I didn't test a model with attn.qkv weight, so better to be conservative with a K-Quant.

Update llama.cpp typo

62c1f5b

Nexesenex mentioned this pull request Mar 26, 2024

Fix IQ1_S quantization #6287

Draft

ikawrakow reviewed Mar 26, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

ikawrakow reviewed Mar 26, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

Nexesenex added 2 commits March 26, 2024 09:17

Update llama.cpp - remove trailing space

d183936

Update llama.cpp - exception for the IQ2_S token embedding error

eaf9571

Nexesenex closed this Mar 26, 2024

Nexesenex reopened this Mar 26, 2024

Merge branch 'master' into Nexesenex-IQ1_XS-IQ1_S-quant-strategies

915be46

Nexesenex commented Mar 26, 2024

View reviewed changes

llama.cpp Show resolved Hide resolved

Nexesenex requested a review from ikawrakow March 26, 2024 17:33

Nexesenex changed the title ~~IQ1_XS FTYPE quant strategy attempt~~ IQ1_XS FTYPE quant strategy Mar 26, 2024

Nexesenex marked this pull request as draft March 27, 2024 12:48

Update llama.cpp

ed4be6b

- IQ4_XS output for models lesser than 8 experts or GQA 8 - granularity for QKV tensor when existing - also, drop for Mistral & Yi attn.k.weight from IQ2_XS to IQ2_XXS

Nexesenex marked this pull request as ready for review March 29, 2024 12:15

mofosyne added generation quality Quality of model output Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024

mofosyne and others added 2 commits May 10, 2024 15:15

Merge branch 'master' into Nexesenex-IQ1_XS-IQ1_S-quant-strategies

3a83878

Update llama.h respect current numerology

e4ac8ae

Nexesenex marked this pull request as draft August 10, 2024 18:30

Nexesenex closed this Aug 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IQ1_XS FTYPE quant strategy #6310

IQ1_XS FTYPE quant strategy #6310

Nexesenex commented Mar 26, 2024 •

edited

Loading

ikawrakow commented Mar 26, 2024

Nexesenex commented Mar 26, 2024 •

edited

Loading

Nindaleth commented Mar 26, 2024 •

edited

Loading

Nexesenex commented Mar 26, 2024

Nindaleth commented Mar 26, 2024

Nexesenex left a comment

Nexesenex commented Mar 27, 2024 •

edited

Loading

DesperateZero commented Mar 28, 2024

Nexesenex commented Mar 28, 2024 •

edited

Loading

github-actions bot commented May 11, 2024

IQ1_XS FTYPE quant strategy #6310

IQ1_XS FTYPE quant strategy #6310

Conversation

Nexesenex commented Mar 26, 2024 • edited Loading

ikawrakow commented Mar 26, 2024

Nexesenex commented Mar 26, 2024 • edited Loading

Nindaleth commented Mar 26, 2024 • edited Loading

Nexesenex commented Mar 26, 2024

Nindaleth commented Mar 26, 2024

Nexesenex left a comment

Choose a reason for hiding this comment

Nexesenex commented Mar 27, 2024 • edited Loading

DesperateZero commented Mar 28, 2024

Nexesenex commented Mar 28, 2024 • edited Loading

github-actions bot commented May 11, 2024

Nexesenex commented Mar 26, 2024 •

edited

Loading

Nexesenex commented Mar 26, 2024 •

edited

Loading

Nindaleth commented Mar 26, 2024 •

edited

Loading

Nexesenex commented Mar 27, 2024 •

edited

Loading

Nexesenex commented Mar 28, 2024 •

edited

Loading