Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

cafaxo · 2024-01-03T12:30:13Z

I noticed that sometimes a very odd token is generated in the beginning using the CPU backend.
Example (CPU, zero temperature):

./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 0 --mlock
[...]
 The Julia programming language. surely, the Julia programming language is a

Example (GPU, zero temperature):

./main -m llama-2-7b.Q4_K_M.gguf --samplers "temp" --temp 0 -p "The Julia programming language." --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0
[...]
 The Julia programming language.
Julia is a high-level, [...]

The "surely," token is nonsense. The different output is caused by the following difference between the CPU and GPU backends:
Consider a matrix-vector product A*x where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.

I tried to find out why these kind of nonsense tokens are generated right at the beginning. I have a suspicion, but no definite answer:
Transformer states can be quite sparse at the initial tokens. Example state in the third layer resulting from the first token:

These spikes amplify the quantization errors in the corresponding blocks:

Transformer states seem to become less sparse / more diffuse over time. I am not completely sure that these spikes are really the cause, but it seems that the q8 quantization in the CPU backend definitely hurts numerical accuracy of matrix-vector products.

The text was updated successfully, but these errors were encountered:

JohannesGaessler · 2024-01-03T12:54:39Z

Consider a matrix-vector product A*x where A is quantized (e.g. q4) and x is not quantized (e.g. float32).
The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.

This is incorrect. It is true that the CUDA backend can use cuBLAS to do matrix matrix multiplications as FP16/FP32. But cuBLAS is only used on Volta or newer and only if the batch size is > 32. In all other cases mul_mat_q, a matrix matrix multiplication where the hidden state is quantized to q8_1 is used. This implementation should be equivalent to the CPU implementation within rounding error. In particular, the example you provided does not use cuBLAS GEMM.

It's still possible that there is something wrong with the CPU implementation but the quantization of the hidden state is not the cause.

cafaxo · 2024-01-03T12:57:23Z

I am using the Metal GPU backend. If I am reading this correctly, it dequantizes here:

llama.cpp/ggml-metal.metal

Line 3904 in f3f62f0

dequantize_func(x, il, temp_a);

JohannesGaessler · 2024-01-03T13:01:41Z

Okay, I don't know what the Metal code does. In any case, this is the output I get with the CUDA code that quantizes the hidden state to q8_1:

The Julia programming language.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and

You only posted the first few tokens from your result but I am not observing numerical issues for this test case.

ggerganov · 2024-01-03T13:11:08Z

This seems very similar to my observations in #2421

I still don't have a good understanding of why this happens, but it seems LLaMA v2 is much more susceptible compared to LLaMA v1 for some reason. Given that CUDA quantizes the hidden state, but does not reproduce the behaviour the root cause might be somewhere else. Anyway, looking forward to further analysis

JohannesGaessler · 2024-01-03T13:28:45Z

~~I think there is something going wrong in the first layer but I don't think it's the matrix multiplication.~~ I edited ggml_cuda_can_mul_mat to always do matrix multiplications on the GPU regardless of batch size but I still get garbage outputs until all repeating layers are on the GPU:

> ./main -m models/nvme/llama_2-7b-q4_k_m.gguf --samplers "temp" --temp 0 -p "The Julia programming" --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 31 --mlock --n-predict 50
[...]
The Julia programming language▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅▅

> ./main -m models/nvme/llama_2-7b-q4_k_m.gguf --samplers "temp" --temp 0 -p "The Julia programming" --no-penalize-nl --top-k 0 --top-p 1.0 --min-p 0.0 --repeat-penalty 1.0 -ngl 32 --mlock --n-predict 50
[...]
The Julia programming language is a high-level, high-performance dynamic programming language. It is designed to be easy to use, easy to learn, and easy to implement. It is also designed to be fast, efficient, and easy to maintain.
Jul

JohannesGaessler · 2024-01-03T14:13:24Z

I retract my previous statement. I did another test where I compiled with LLAMA_OPENBLAS and edited ggml_compute_forward_use_blas to use OpenBLAS for batch sizes >= 2 and this fixes the issue:

The Julia programming language.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and

So the issue seems to be CPU matrix multiplication after all. Specifically, it seems the issue only occurs if the non-OpenBLAS matrix multiplication is used for q4_K tensors. So there may be something subtly wrong with the matrix multiplication of either the format or the specific matrices where it is used. But I get good outputs with q3_K_L so I think the problem is q4_K matrix multiplication.

JohannesGaessler · 2024-01-03T14:25:08Z

In another test q3_K_M is also working as expected despite containing q4_K tensors; this issue is difficult to pin down.

cafaxo · 2024-01-03T15:14:38Z

I still think that the quantization is the culprit:
The CPU backend quantizes to q8_K, which has block size 256.
CUDA quantizes to q8_1, which has block size 32.

This gives a hint that the spikes might actually be the problem: block size 32 is less sensitive to spikes.

I ran some tests and can confirm that just changing the quantization to q8_1 instead of q8_K gets rid of the "surely" token.

JohannesGaessler · 2024-01-03T17:28:31Z

I am currently trying to do a CUDA implementation for matrix multiplication that utilizes int8 tensor cores. A major issue is that loading the results from tensor cores has terrible performance. So I will soon try an implementation where the inputs are quantized as int8 but with a single scale per row/column. If the issue is in fact that the CPU quantization block size is too large the quality of this implementation should be bad; I'll report back when I have a working prototype.

JohannesGaessler · 2024-01-03T20:30:01Z

I implemented a prototype for 8 bit quantization of the hidden state with only a single scale per column. The resulting implementation has very similar issues to the CPU implementation. This is the output that I get:

The Julia programming language.Љаулa: 1.1.1.
The Julia programming language.
The Julia programming language is a free, open-source, and fast-gungu, which is a free, open-source, and fast

In this case the prompt was "The Julia programming language.". But more generally, every time after a period there is a high likelihood that the next token is garbage even if the prompt does not end with a period. So I think this really is an issue related to numerics and the large block size for the CPU backend.

JohannesGaessler · 2024-01-03T21:09:41Z

I've been thinking, if a block size of 256 for quantizing the hidden state really causes garbage tokens upon punctuation, how do we know that a block size of 32 isn't still causing some form of damage? Does llama.cpp have a built-in way of looking at token probabilities?

JohannesGaessler · 2024-01-03T21:20:09Z

I generated some samples for the prompt "The Julia programming language." using either mul_mat_q (quantizes hidden state to q8_1) or dequantize_mul_mat_vec (does not quantize the hidden state) for prompt processing. Subjectively I do not feel like mul_mat_q produced more garbage tokens.

JohannesGaessler · 2024-01-03T22:24:15Z

I did some more testing: when using a single scale per hidden state column I need 10 bits per weight to get a continuation for "The Julia programming language." using 7b q8_0 that isn't garbage:

The Julia programming language. Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive

For 11 or more bits per weight I always get:

The Julia programming language.
Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and

slaren · 2024-01-03T23:40:48Z

Does llama.cpp have a built-in way of looking at token probabilities?

Not sure if it is exactly what you need, but the default site of the server example has an option to show token probabilities.

Sixzero · 2024-01-04T10:11:20Z

How come GPU can do what the 10bit solution does on CPU. 🤔

github-actions · 2024-03-18T01:34:39Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:09:01Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

This was referenced Jan 4, 2024

SOTA 2-bit quants #4773

Merged

CUDA: int8 tensor core matrix multiplication #4801

Closed

ikawrakow mentioned this issue Jan 15, 2024

Add ability to use importance matrix for all k-quants #4930

Merged

github-actions bot added the stale label Mar 18, 2024

cafaxo mentioned this issue Mar 22, 2024

add snapshot tests cafaxo/Llama2.jl#31

Draft

github-actions bot closed this as completed Apr 2, 2024

cafaxo mentioned this issue Apr 2, 2024

Reproduce temp=0 llama.cpp results with some consistency. cafaxo/Llama2.jl#28

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

cafaxo commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

cafaxo commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

ggerganov commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024 •

edited

Loading

JohannesGaessler commented Jan 3, 2024 •

edited

Loading

JohannesGaessler commented Jan 3, 2024

cafaxo commented Jan 3, 2024 •

edited

Loading

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

slaren commented Jan 3, 2024

Sixzero commented Jan 4, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755

Comments

cafaxo commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

cafaxo commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

ggerganov commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024 • edited Loading

JohannesGaessler commented Jan 3, 2024 • edited Loading

JohannesGaessler commented Jan 3, 2024

cafaxo commented Jan 3, 2024 • edited Loading

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

JohannesGaessler commented Jan 3, 2024

slaren commented Jan 3, 2024

Sixzero commented Jan 4, 2024

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

JohannesGaessler commented Jan 3, 2024 •

edited

Loading

JohannesGaessler commented Jan 3, 2024 •

edited

Loading

cafaxo commented Jan 3, 2024 •

edited

Loading