Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k_quants tuning for Falcon-7b #2816

Merged
merged 2 commits into from
Aug 27, 2023
Merged

k_quants tuning for Falcon-7b #2816

merged 2 commits into from
Aug 27, 2023

Conversation

ikawrakow
Copy link
Contributor

@ikawrakow ikawrakow commented Aug 26, 2023

Falcon-7b requires using k-qunats super-blocks of QK_K=64 instead of the usual QK_K=256 (LLAMA_QKK_64=ON when building). This PR

  • Partially fixes the CUDA implementation. The fix is partial in the sense that it now compiles and runs correctly, but only when built with LLAMA_CUDA_FORCE_DMMV=ON and run with -nommq (CUDA does not build when QK_K = 64 #2815) There are also many warnings when compiling ggml-cuda.cu.
  • Adds fine tuning of the various k_quants quantization mixes for Falcon-7b. This is necessary because of the different architecture (different set of tensors per layer) and because relative quantization sizes differ between QK_K = 256 and QK_K = 64
  • Forces Q8_0 quantization of the output.weight tensor for Falcon models for all quantization types. This makes a huge difference for Q4/5_0/1. For instance, Q4_0 perplexity becomes 7.2451 from 8.3948 without the changes in this PR! For Q5_0 the change is from 7.4725 to 7.1605 (Falcon-7b perplexity for fp16 is 7.1213).

Some observations:

  • k_quants below Q3_K_M are not really viable for Falcon-7b.
  • Q4/5_0/1 are highly competitive with the k_quants when the output.weight tensor is quantized with Q8_0.
  • The difference between matrix multiplications using BLAS (-nommq) and the quantized implementation is much bigger compared the LLaMA models. For instance, for Q4_0, -nommq is 0.031 lower, which I think is not acceptable. In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable). Perhaps we should consider reverting CUDA: use mul_mat_q kernels by default #2683 so quantized matrix multiplications are opt-in rather than the default?

The following graph shows perplexity scores for Falcon-7B for different quantization types using this PR. All calculations were run with -nommq.

ppl_7b

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for causing more work for you; I thought I had checked QK_K=64 but it seems I forgot. I would have fixed it myself but I didn't work on llama.cpp the last few days.

Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.
@ikawrakow ikawrakow changed the title Make ggml-cuda.cu build with QK_K = 64 k_quants tuning for Falcon-7b Aug 27, 2023
@ikawrakow ikawrakow marked this pull request as ready for review August 27, 2023 09:45
@klosax
Copy link
Contributor

klosax commented Aug 27, 2023

Perhaps we should consider reverting #2683 so quantized matrix multiplications are opt-in rather than the default?

Yes MMQ should not be default, not in Falcon at least. With MMQ enabled, some hellaswag tests are failing completely because of NaNs. See PR #2765

@JohannesGaessler
Copy link
Collaborator

In comparison, for LLaMA-v2-7B the difference is 0.006 (which is also quite big for my taste, but borderline acceptable).

Keep in mind that mul_mat_q reduces VRAM usage and thus allows you to run better quantization though. So I would argue that with the same hardware you can still achieve better perplexity.

Perhaps we should consider reverting #2683 so quantized matrix multiplications are opt-in rather than the default?

The overwhelming majority of users are running LLaMA-based models and I think the defaults should reflect that. So I think mul_mat_q should remain the default.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Aug 27, 2023

I just remembered: the mul_mat_q kernels are compiled unconditionally with -use_fast_math (in the Makefile) but I don't think I checked perplexity in detail when I added it. So maybe this is the reason?

@ggerganov
Copy link
Owner

ggerganov commented Aug 27, 2023

I just remembered: the mul_mat_q kernels are compiled unconditionally with -use_fast_math (in the Makefile) but I don't think I checked perplexity when I added it. So maybe this is the reason?

This is highly likely to be causing problems. On Metal, building with -ffast-math caused NaNs in the GELU kernel which it took me a lot of time to figure out. Here is the Metal spec:

image

https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

One can explicitly use "precise" math functions by calling precise::tanh which overrides -ffast-math:

image

Simply changing the kernel to use precise::tanh fixed the Metal inference:

0a85ae7

llama.cpp/ggml-metal.metal

Lines 91 to 102 in 1591e2e

kernel void kernel_gelu(
device const float * src0,
device float * dst,
uint tpig[[thread_position_in_grid]]) {
float x = src0[tpig];
// BEWARE !!!
// Simply using "tanh" instead of "precise::tanh" will sometimes results in NaNs!
// This was observed with Falcon 7B and 40B models
//
dst[tpig] = 0.5f*x*(1.0f + precise::tanh(SQRT_2_OVER_PI*x*(1.0f + GELU_COEF_A*x*x)));
}

@ikawrakow
Copy link
Contributor Author

I'm not sure what -ffast-math or the precision of tanh on Metal has to do with the precision loss due to quantizing intermediate results that is the root cause of the increased perplexity when using quantized matrix multiplications. But given the various arguments being made, perhaps it would be better to open an issue under Discussions?

@ikawrakow ikawrakow merged commit a6d1189 into master Aug 27, 2023
24 checks passed
@ikawrakow ikawrakow deleted the ik/fix_cuda_qkk64 branch August 27, 2023 12:20
akawrykow pushed a commit to akawrykow/llama.cpp that referenced this pull request Aug 29, 2023
* Make ggml-cuda.cu build with QK_K = 64

Using LLAMA_CUDA_FORCE_DMMV = ON and -nommq it runs and produces
a meaningful result.

* k_quants tuning for Falcon-7b

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
@@ -4762,7 +4762,10 @@ static void llama_model_quantize_internal(const std::string & fname_inp, const s

if (name == tn(LLM_TENSOR_OUTPUT, "weight")) {
int nx = tensor->ne[0];
if (nx % QK_K == 0) {
if (model.arch == LLM_ARCH_FALCON || nx % QK_K != 0) {
new_type = GGML_TYPE_Q8_0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we use Q8_0 when GGML_USE_K_QUANTS is disabled?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants