-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quantization of transformer state for matrix-vector products potentially causes numerical accuracy issues #4755
Comments
This is incorrect. It is true that the CUDA backend can use cuBLAS to do matrix matrix multiplications as FP16/FP32. But cuBLAS is only used on Volta or newer and only if the batch size is > 32. In all other cases mul_mat_q, a matrix matrix multiplication where the hidden state is quantized to q8_1 is used. This implementation should be equivalent to the CPU implementation within rounding error. In particular, the example you provided does not use cuBLAS GEMM. It's still possible that there is something wrong with the CPU implementation but the quantization of the hidden state is not the cause. |
I am using the Metal GPU backend. If I am reading this correctly, it dequantizes here: Line 3904 in f3f62f0
|
Okay, I don't know what the Metal code does. In any case, this is the output I get with the CUDA code that quantizes the hidden state to q8_1:
You only posted the first few tokens from your result but I am not observing numerical issues for this test case. |
This seems very similar to my observations in #2421 I still don't have a good understanding of why this happens, but it seems LLaMA v2 is much more susceptible compared to LLaMA v1 for some reason. Given that CUDA quantizes the hidden state, but does not reproduce the behaviour the root cause might be somewhere else. Anyway, looking forward to further analysis |
|
I retract my previous statement. I did another test where I compiled with
So the issue seems to be CPU matrix multiplication after all. Specifically, it seems the issue only occurs if the non-OpenBLAS matrix multiplication is used for q4_K tensors. So there may be something subtly wrong with the matrix multiplication of either the format or the specific matrices where it is used. But I get good outputs with q3_K_L so I think the problem is q4_K matrix multiplication. |
In another test q3_K_M is also working as expected despite containing q4_K tensors; this issue is difficult to pin down. |
I still think that the quantization is the culprit: This gives a hint that the spikes might actually be the problem: block size 32 is less sensitive to spikes. I ran some tests and can confirm that just changing the quantization to q8_1 instead of q8_K gets rid of the "surely" token. |
I am currently trying to do a CUDA implementation for matrix multiplication that utilizes int8 tensor cores. A major issue is that loading the results from tensor cores has terrible performance. So I will soon try an implementation where the inputs are quantized as int8 but with a single scale per row/column. If the issue is in fact that the CPU quantization block size is too large the quality of this implementation should be bad; I'll report back when I have a working prototype. |
I implemented a prototype for 8 bit quantization of the hidden state with only a single scale per column. The resulting implementation has very similar issues to the CPU implementation. This is the output that I get:
In this case the prompt was "The Julia programming language.". But more generally, every time after a period there is a high likelihood that the next token is garbage even if the prompt does not end with a period. So I think this really is an issue related to numerics and the large block size for the CPU backend. |
I've been thinking, if a block size of 256 for quantizing the hidden state really causes garbage tokens upon punctuation, how do we know that a block size of 32 isn't still causing some form of damage? Does llama.cpp have a built-in way of looking at token probabilities? |
I generated some samples for the prompt "The Julia programming language." using either |
I did some more testing: when using a single scale per hidden state column I need 10 bits per weight to get a continuation for "The Julia programming language." using 7b q8_0 that isn't garbage:
For 11 or more bits per weight I always get:
|
Not sure if it is exactly what you need, but the default site of the |
How come GPU can do what the 10bit solution does on CPU. 🤔 |
This issue is stale because it has been open for 30 days with no activity. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I noticed that sometimes a very odd token is generated in the beginning using the CPU backend.
Example (CPU, zero temperature):
Example (GPU, zero temperature):
The "surely," token is nonsense. The different output is caused by the following difference between the CPU and GPU backends:
Consider a matrix-vector product
A*x
where A is quantized (e.g. q4) and x is not quantized (e.g. float32).The CPU backend quantizes x to q8 and then computes the product using an optimized vecdot(q4, q8) routine.
The GPU backend dequantizes A to float32 and then computes the product using float32.
I tried to find out why these kind of nonsense tokens are generated right at the beginning. I have a suspicion, but no definite answer:
Transformer states can be quite sparse at the initial tokens. Example state in the third layer resulting from the first token:
These spikes amplify the quantization errors in the corresponding blocks:
Transformer states seem to become less sparse / more diffuse over time. I am not completely sure that these spikes are really the cause, but it seems that the q8 quantization in the CPU backend definitely hurts numerical accuracy of matrix-vector products.
The text was updated successfully, but these errors were encountered: