k_quants tuning for Falcon-7b #2816

ikawrakow · 2023-08-26T16:42:14Z

Falcon-7b requires using k-qunats super-blocks of QK_K=64 instead of the usual QK_K=256 (LLAMA_QKK_64=ON when building). This PR

Partially fixes the CUDA implementation. The fix is partial in the sense that it now compiles and runs correctly, but only when built with LLAMA_CUDA_FORCE_DMMV=ON and run with -nommq (CUDA does not build when QK_K = 64 #2815) There are also many warnings when compiling ggml-cuda.cu.
Adds fine tuning of the various k_quants quantization mixes for Falcon-7b. This is necessary because of the different architecture (different set of tensors per layer) and because relative quantization sizes differ between QK_K = 256 and QK_K = 64
Forces Q8_0 quantization of the output.weight tensor for Falcon models for all quantization types. This makes a huge difference for Q4/5_0/1. For instance, Q4_0 perplexity becomes 7.2451 from 8.3948 without the changes in this PR! For Q5_0 the change is from 7.4725 to 7.1605 (Falcon-7b perplexity for fp16 is 7.1213).

Some observations:

k_quants below Q3_K_M are not really viable for Falcon-7b.
Q4/5_0/1 are highly competitive with the k_quants when the output.weight tensor is quantized with Q8_0.
The difference between matrix multiplications using BLAS (-nommq) and the quantized implementation is much bigger compared the LLaMA models. For instance, for Q4_0, -nommq is 0.031 lower, which I think is not acceptable. In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable). Perhaps we should consider reverting CUDA: use mul_mat_q kernels by default #2683 so quantized matrix multiplications are opt-in rather than the default?

The following graph shows perplexity scores for Falcon-7B for different quantization types using this PR. All calculations were run with -nommq.

JohannesGaessler

Sorry for causing more work for you; I thought I had checked QK_K=64 but it seems I forgot. I would have fixed it myself but I didn't work on llama.cpp the last few days.

Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result.

klosax · 2023-08-27T10:28:02Z

Perhaps we should consider reverting #2683 so quantized matrix multiplications are opt-in rather than the default?

Yes MMQ should not be default, not in Falcon at least. With MMQ enabled, some hellaswag tests are failing completely because of NaNs. See PR #2765

JohannesGaessler · 2023-08-27T10:38:57Z

In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable).

Keep in mind that mul_mat_q reduces VRAM usage and thus allows you to run better quantization though. So I would argue that with the same hardware you can still achieve better perplexity.

Perhaps we should consider reverting #2683 so quantized matrix multiplications are opt-in rather than the default?

The overwhelming majority of users are running LLaMA-based models and I think the defaults should reflect that. So I think mul_mat_q should remain the default.

JohannesGaessler · 2023-08-27T10:55:39Z

I just remembered: the mul_mat_q kernels are compiled unconditionally with -use_fast_math (in the Makefile) but I don't think I checked perplexity in detail when I added it. So maybe this is the reason?

ggerganov · 2023-08-27T11:06:52Z

I just remembered: the mul_mat_q kernels are compiled unconditionally with -use_fast_math (in the Makefile) but I don't think I checked perplexity when I added it. So maybe this is the reason?

This is highly likely to be causing problems. On Metal, building with -ffast-math caused NaNs in the GELU kernel which it took me a lot of time to figure out. Here is the Metal spec:

https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

One can explicitly use "precise" math functions by calling precise::tanh which overrides -ffast-math:

Simply changing the kernel to use precise::tanh fixed the Metal inference:

0a85ae7

llama.cpp/ggml-metal.metal

Lines 91 to 102 in 1591e2e

    
           kernel void kernel_gelu( 
        
               device const float * src0, 
        
               device       float * dst, 
        
               uint tpig[[thread_position_in_grid]]) { 
        
               float x = src0[tpig]; 
        
               // BEWARE !!! 
        
               // Simply using "tanh" instead of "precise::tanh" will sometimes results in NaNs! 
        
               // This was observed with Falcon 7B and 40B models 
        
               // 
        
               dst[tpig] = 0.5f*x*(1.0f + precise::tanh(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x))); 
        
           }

ikawrakow · 2023-08-27T11:28:25Z

I'm not sure what -ffast-math or the precision of tanh on Metal has to do with the precision loss due to quantizing intermediate results that is the root cause of the increased perplexity when using quantized matrix multiplications. But given the various arguments being made, perhaps it would be better to open an issue under Discussions?

* Make ggml-cuda.cu build with QK_K = 64 Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result. * k_quants tuning for Falcon-7b --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

cebtenzzre · 2023-09-06T06:21:40Z

llama.cpp

@@ -4762,7 +4762,10 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s

            if (name == tn(LLM_TENSOR_OUTPUT, "weight")) {
                int nx = tensor->ne[0];
-                if (nx % QK_K == 0) {
+                if (model.arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
+                    new_type = GGML_TYPE_Q8_0;


Why don't we use Q8_0 when GGML_USE_K_QUANTS is disabled?

JohannesGaessler approved these changes Aug 27, 2023

View reviewed changes

Kawrakow added 2 commits August 27, 2023 11:34

Make ggml-cuda.cu build with QK_K = 64

18a131d

Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces a meaningful result.

k_quants tuning for Falcon-7b

061f777

ikawrakow force-pushed the ik/fix_cuda_qkk64 branch from f547c58 to 061f777 Compare August 27, 2023 08:45

ikawrakow changed the title ~~Make ggml-cuda.cu build with QK_K = 64~~ k_quants tuning for Falcon-7b Aug 27, 2023

ikawrakow marked this pull request as ready for review August 27, 2023 09:45

ikawrakow merged commit a6d1189 into master Aug 27, 2023
24 checks passed

ikawrakow deleted the ik/fix_cuda_qkk64 branch August 27, 2023 12:20

cebtenzzre reviewed Sep 6, 2023

View reviewed changes

ikawrakow mentioned this pull request Sep 10, 2023

Adding SqueezeLLM Support #3093

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k_quants tuning for Falcon-7b #2816

k_quants tuning for Falcon-7b #2816

ikawrakow commented Aug 26, 2023 •

edited

Loading

JohannesGaessler left a comment

klosax commented Aug 27, 2023

JohannesGaessler commented Aug 27, 2023

JohannesGaessler commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading

ikawrakow commented Aug 27, 2023

cebtenzzre Sep 6, 2023

k_quants tuning for Falcon-7b #2816

k_quants tuning for Falcon-7b #2816

Conversation

ikawrakow commented Aug 26, 2023 • edited Loading

JohannesGaessler left a comment

Choose a reason for hiding this comment

klosax commented Aug 27, 2023

JohannesGaessler commented Aug 27, 2023

JohannesGaessler commented Aug 27, 2023 • edited Loading

ggerganov commented Aug 27, 2023 • edited Loading

ikawrakow commented Aug 27, 2023

cebtenzzre Sep 6, 2023

Choose a reason for hiding this comment

ikawrakow commented Aug 26, 2023 •

edited

Loading

JohannesGaessler commented Aug 27, 2023 •

edited

Loading

ggerganov commented Aug 27, 2023 •

edited

Loading