-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k_quants tuning for Falcon-7b #2816
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for causing more work for you; I thought I had checked QK_K=64
but it seems I forgot. I would have fixed it myself but I didn't work on llama.cpp the last few days.
Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result.
f547c58
to
061f777
Compare
Keep in mind that mul_mat_q reduces VRAM usage and thus allows you to run better quantization though. So I would argue that with the same hardware you can still achieve better perplexity.
The overwhelming majority of users are running LLaMA-based models and I think the defaults should reflect that. So I think mul_mat_q should remain the default. |
I just remembered: the |
This is highly likely to be causing problems. On Metal, building with https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf One can explicitly use "precise" math functions by calling Simply changing the kernel to use Lines 91 to 102 in 1591e2e
|
I'm not sure what |
* Make ggml-cuda.cu build with QK_K = 64 Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result. * k_quants tuning for Falcon-7b --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@@ -4762,7 +4762,10 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s | |||
|
|||
if (name == tn(LLM_TENSOR_OUTPUT, "weight")) { | |||
int nx = tensor->ne[0]; | |||
if (nx % QK_K == 0) { | |||
if (model.arch == LLM_ARCH_FALCON || nx % QK_K != 0) { | |||
new_type = GGML_TYPE_Q8_0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we use Q8_0 when GGML_USE_K_QUANTS is disabled?
Falcon-7b requires using k-qunats super-blocks of
QK_K=64
instead of the usualQK_K=256
(LLAMA_QKK_64=ON
when building). This PRLLAMA_CUDA_FORCE_DMMV=ON
and run with-nommq
(CUDA does not build when QK_K = 64 #2815) There are also many warnings when compilingggml-cuda.cu
.QK_K = 256
andQK_K = 64
Q8_0
quantization of theoutput.weight
tensor for Falcon models for all quantization types. This makes a huge difference forQ4/5_0/1
. For instance,Q4_0
perplexity becomes 7.2451 from 8.3948 without the changes in this PR! ForQ5_0
the change is from 7.4725 to 7.1605 (Falcon-7b perplexity forfp16
is 7.1213).Some observations:
Q3_K_M
are not really viable for Falcon-7b.Q4/5_0/1
are highly competitive with the k_quants when theoutput.weight
tensor is quantized withQ8_0
.-nommq
) and the quantized implementation is much bigger compared the LLaMA models. For instance, forQ4_0
,-nommq
is 0.031 lower, which I think is not acceptable. In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable). Perhaps we should consider reverting CUDA: use mul_mat_q kernels by default #2683 so quantized matrix multiplications are opt-in rather than the default?The following graph shows perplexity scores for Falcon-7B for different quantization types using this PR. All calculations were run with
-nommq
.