-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8-bit Quantization #298
base: master
Are you sure you want to change the base?
8-bit Quantization #298
Conversation
Nice, this will be a helpful reference. This is Q8_1 scheme. A few things that are in my mind for quantization:
roughly some of the things that come to mind |
Acts quant most likely would need fine tuning though, wouldn’t it? |
@byte-6174 not to my knowledge? it's possible to do quantization-aware finetuning to improve a model for quantization, but you can quantize it anyway. |
btw, this PR also works for quantizing the |
Following up with the little pseudo-code above, checked in a small experiment that does weights/acts quant on the fly. needs more experiments as blindly turning all matmults run on int8 will run everything to ground :) the code of interest is this:
|
Hello @byte-6174 , When I try to run llama2 7b chat quantized version, I get gibberish. I did get coherent response from quantized stories42M. using run
using runq
Not sure what could be wrong. |
@mgrabban How did you quantize the 7B model? Can you show the output of the quantization? (it can be a link) |
For the CUDA implementation, check #310 |
You can find the output here
|
I suspect it is related to the With
And with that chat model it is But the Can you check with the last commit I sent? |
yes, that change was needed for llama2 7B model. thanks @kroggen !
|
Quantization here is per layer instead of groups. That feels risky? I'd expect llama.cpp does groups? |
Yes. Llama2.cpp has groups 64 etc. |
One outlier nukes the whole tensor. I'm starting a branch for int8 quantization now. I'll do groups. |
Humm. Trying to understand this, So the groups of 64 avoids this how ? |
If there is a bad outlier somewhere, only e.g. up to 63 elements get "messed up" with high error, not the entire tensor. So breaking things up into groups makes things more robust to outliers. |
Btw as a side note : there is experimental evidence,in llama.cpp and also places like llm.int8, of needing mixed precision to tackle outliers. Thought we might want to / have to consider ?! |
That's a good point... The outliers are really annoying because they necessitate ugly special casing. Hmmm |
Ya this is where weight statistics analysis and all that business is needed, not as clean 🤢 |
Right to solve that to the extreme we can / or need to scale rows and columns convert to int8 send to matmul. |
There is another cool approach I remember from song han with k-means. But another added complexity. 😜 |
The reason of the choice is because this repo is mainly for educational purposes, and using one scale factor per layer is the simplest way to show dequantization on-the-fly The code is easier to understand It is not the most precise. If we want to go to the edge, we can just use llama.cpp It would be good to check the ppl values to compare how much it decreases with this approach |
Btw this PR doesn't actually perform integer dot products, it dequantizes the weight to float in the inner loop and uses that. Are people still obtaining speed improvements from this "weak" version of quantization? |
I tested stories110M.bin. In normal CPU (e.g. i7-8700K), runq is faster than runopm (which is faster than runfast and run). For high end CPU, runq is faster than runfast and run, but runomp is the fastest (since there are many cores available). |
These are my results When compiled with these commands:
On MacBook Pro Intel:
On Linux with AMD EPYC-Rome:
On Apple M1/M2 chips there is no difference in performance, probably because they have integrated RAM and these other chips have external RAM. When compiling both with OpenMP on Linux/AMD using these commands:
I got these results:
|
It is intentional. Multiplication with int8 requires a lot of other computations, making it slower than using float32 multiplication. With dequantization on-the-fly, the code is simpler and faster at the same time The downside in the current implementation is the use of only 1 scale factor per layer. This can be changed |
Yes, moreover even that downside is not that significant -- as proven by our results for at least Up to 7B model. Real deal breaker is the outliers and solution is mixed precision of some sort. |
This PR also has work from Aniket
It implements very basic but understandable 8-bit quantization (using
quantize.c
) and also dequantization on-the-fly onmatmul
,rmsnorm
anddequantize_token
The RoPE weights are intentionally not quantized as it may cause some loss, although not tested
Example Usage