Skip to content

Commit b77cdd8

Browse files
authored
Small changes for IQ2 quant strategies (notably IQ2_S and IQ2_M)
Here's a few edits I consider useful to improve a bit the IQ2 model quant strategies for some models: - The tensor attn.v.weight passed in Q4_K for models like Gemma (GQA 2), and the various franken MOEs having 2 experts, this to not sabotage them with a too small value head quant (Q2_K is meh for such important head) while the size of that head is low relatively to the total size of the affected models. - The tensor attn.k.weight passed in Q4_K for models with 8 experts or more, rather than simply 8 experts. - The tensor attn.output.weight passed in IQ3_XXS (instead of IQ3_S) for the quant strategies IQ2_S and IQ2_M, this to have a progressiveness between the IQ2_XS quant strategies (which use IQ2_XS for the attn.output.weight) and the IQ3_XXS quant strategies (which use.. IQ3_S quant for attn.output.weight). The benefit of an IQ3_S quant instead of an IQ3_XXS for that tensor is quasi-inexistant on IQ2_S and IQ2_M quant strategies, especially compared to the size bump it provokes. More broadly, I think that the whole IQ2 quant strategies bunch should be harmonized/refactored like the rest of the quant strategies are established (tensor by tensor), rather than under an different kind of tree mixing these 5 quant strategies. I'm using these settings (and many more edits) for a long time, with benefit, and I think they could be standard.
1 parent e09a800 commit b77cdd8

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

src/llama.cpp

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15348,11 +15348,11 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1534815348
} else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_XXS || ftype == LLAMA_FTYPE_MOSTLY_IQ2_XS || ftype == LLAMA_FTYPE_MOSTLY_IQ1_S ||
1534915349
ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) {
1535015350
if (name.find("attn_v.weight") != std::string::npos) {
15351-
if (qs.model.hparams.n_gqa() >= 4 || qs.model.hparams.n_expert >= 4) new_type = GGML_TYPE_Q4_K;
15351+
if (qs.model.hparams.n_gqa() >= 2 || qs.model.hparams.n_expert >= 2) new_type = GGML_TYPE_Q4_K;
1535215352
else new_type = ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M ? GGML_TYPE_IQ3_S : GGML_TYPE_Q2_K;
1535315353
++qs.i_attention_wv;
1535415354
}
15355-
else if (qs.model.hparams.n_expert == 8 && name.find("attn_k.weight") != std::string::npos) {
15355+
else if (qs.model.hparams.n_expert >= 8 && name.find("attn_k.weight") != std::string::npos) {
1535615356
new_type = GGML_TYPE_Q4_K;
1535715357
}
1535815358
else if (name.find("ffn_down") != std::string::npos) {
@@ -15366,7 +15366,7 @@ static ggml_type llama_tensor_get_type(quantize_state_internal & qs, ggml_type n
1536615366
new_type = GGML_TYPE_Q5_K;
1536715367
} else {
1536815368
if (ftype == LLAMA_FTYPE_MOSTLY_IQ1_S || ftype == LLAMA_FTYPE_MOSTLY_IQ1_M) new_type = GGML_TYPE_IQ2_XXS;
15369-
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_S;
15369+
else if (ftype == LLAMA_FTYPE_MOSTLY_IQ2_S || ftype == LLAMA_FTYPE_MOSTLY_IQ2_M) new_type = GGML_TYPE_IQ3_XXS;
1537015370
}
1537115371
}
1537215372
} else if (name.find("attn_v.weight") != std::string::npos) {

0 commit comments

Comments
 (0)