Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M)

Nexesenex · web-flow · commit b77cdd83ff08 · 2024-08-02T20:40:04.000+02:00
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models:

- The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models.

- The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts.

- The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes.

More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies.

I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.
diff --git a/src/llama.cpp b/src/llama.cpp
@@ -15348,11 +15348,11 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
     } else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
                ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M    || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
         if (name.find("attn_v.weight") != std::string::npos) {
-            if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
+            if (qs.model.hparams.n_gqa() >= 2 || qs.model.hparams.n_expert >= 2) new_type = GGML_TYPE_Q4_K;
             else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
             ++qs.i_attention_wv;
         }
-        else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
+        else if (qs.model.hparams.n_expert >= 8 && name.find("attn_k.weight") != std::string::npos) {
             new_type = GGML_TYPE_Q4_K;
         }
         else if (name.find("ffn_down") != std::string::npos) {
@@ -15366,7 +15366,7 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
                 new_type = GGML_TYPE_Q5_K;
             } else {
                 if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;
-                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;
+                else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_XXS;
             }
         }
     } else if (name.find("attn_v.weight") != std::string::npos) {

Original file line number	Diff line number	Diff line change
`@@ -15348,11 +15348,11 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n`
`15348`	`15348`	`} else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ1_S \|\|`
`15349`	`15349`	`ftype == LLAMA_FTYPE_MOSTLY_IQ2_S \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ2_M \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {`
`15350`	`15350`	`if (name.find("attn_v.weight") != std::string::npos) {`
`15351`		`- if (qs.model.hparams.n_gqa() >= 4 \|\| qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;`
	`15351`	`+ if (qs.model.hparams.n_gqa() >= 2 \|\| qs.model.hparams.n_expert >= 2) new_type = GGML_TYPE_Q4_K;`
`15352`	`15352`	`else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;`
`15353`	`15353`	`++qs.i_attention_wv;`
`15354`	`15354`	`}`
`15355`		`- else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {`
	`15355`	`+ else if (qs.model.hparams.n_expert >= 8 && name.find("attn_k.weight") != std::string::npos) {`
`15356`	`15356`	`new_type = GGML_TYPE_Q4_K;`
`15357`	`15357`	`}`
`15358`	`15358`	`else if (name.find("ffn_down") != std::string::npos) {`
`@@ -15366,7 +15366,7 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n`
`15366`	`15366`	`new_type = GGML_TYPE_Q5_K;`
`15367`	`15367`	`} else {`
`15368`	`15368`	`if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;`
`15369`		`- else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;`
	`15369`	`+ else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S \|\| ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_XXS;`
`15370`	`15370`	`}`
`15371`	`15371`	`}`
`15372`	`15372`	`} else if (name.find("attn_v.weight") != std::string::npos) {`