-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add exllama q4 kernel #219
Conversation
Awesome! I'm really excited for this. I've done a quick benchmark, on:
Llama 1 7B GPTQGroup size 128, Act-Order/desc_act = FalseAutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention
fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)
Llama 1 33B GPTQGroup size: None, Act-Order/desc_act = TrueAutoGPTQ 0.3.2 main (standard CUDA kernel). With Fused Attention
fxmarty's AutoGPTQ + ExLlama. Without fused attention (see below)
Not yet a huge difference on this CPU-bottlenecked system, but definitely worthwhile - and it should be better still when fused attention works. We also see lower VRAM usage in the 33B test, but that could be down to not using fused attention which tends to use a bit more VRAM. I wanted to test group_size + desc_act, but it crashes at the moment. I found three issues while doing this testing (two bugs, one problem), which I'll put in the next message. |
Bugs:
Issue:
Example: 30B, no group_size, act-order = True
This might be an inevitable result of the ExLlama kernel? But I thought it worth mentioning. Also, when I tried to test 30B 128g + Act-Order, it took something like 10 minutes to load weights before it segmentation faulted due to issue 2. I can't give an exact timing because of the crash, but it seemed from that that perhaps group_size + desc_act was even slower on weight loading. |
The fused attention bug should be solved in the act-order=False case. @TheBloke can you run There is no test yet for the act-order case, so it is expected that it crashes (probably something is wrong in my implem). Will add a test. I'll test as well the load time, thank you! Edit: to me the case |
@fxmarty
The main branch option for any 30/33B or 65B/70B GPTQ I have released, for example: For 33B models in particular no group size is desirable as it lowers VRAM usage, and then act-order is added to get the highest possible quantisation accuracy (at least when evaluated on perplexity). I don't have PPL results to hand for 30B, but for example here is the result at 7B, comparing:
As you see, adding act-order to the quantisation makes a pretty big difference in PPL |
Act-order should work now. You can try Having exllama act-order work with fused QKV is still to do. For the act-order + group case, Running On an Intel Xeon 8275CL CPU + A100 80GB, I get roughly a x2.7 speedup: On main (with this critical fix #220):
On this branch:
Note that we generate 256 tokens with greedy search, and that the inputs are of various (small) shapes:
I did not try batched generation yet. For the no act-order + group case, running I get a x1.3 speedup. On main:
On this branch:
I'll test more thoroughly the load time later. |
This looks very cool. |
@TheBloke |
Do not merge for now please |
Exllama q4 kernel
Hi @PanQiWei, I resolved conflicts and tested both on A100 and MI250, we have both CUDA & RoCm support for it as well which is cool! Free free to have a look and merge if it looks good to you. Note the test for correctness in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've run the tests and benchmarks and all looks good to me. Thank you so much for this great work! Will merge it now.
Exllama kernel for int4/fp16 is notoriously faster than the implem in AutoGPTQ / GPTQ-for-llama (see e.g. huggingface/text-generation-inference#553 (comment), will do a proper benchmark for auto-gptq), especially in the act-order case where weights are reordered ahead of time, and the activations on the fly, removing the need to reorder scales and zero points.
I removed the rocm support for now, let's first merge #214 and then test on rocm this kernel.
Left to do: