-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exllama integration #30
Conversation
7617663 |
Would it make sense to create a |
Exllama optimize q4 matmul and fixbug
@qwopqwop200 after merging your PR, I get the following error when testing with Vicuna 7B:
Am I supposed to re-quant a model to get it working? |
We might have to abort this pull request. The people from MIT just released a new kernel that should be so much faster. |
No, I don't think that's necessary. The current comparison results show that exllama matmul kernel outperforms tinychatv2 matmul kernel. The current speed improvement of tinychatv2 appears to be due to additional optimizations such as attention. exllama tinychatv2(gemv) ----edit---- |
I will do some more testing. Perhaps we can have both exllama and the new GEMV kernel. Just need to convert weights to exllama compatible version. |
It seems GEMV kernel is faster than GEMM by 20% but processing context is slow and batch sizes should also be slow. I have made it easy to extend to other methods like exllama, which will come shortly after. GEMV, GEMM, and ExLlama will all have three different quantization formats and kernels. |
I tested this PR with llama2-13b in one nvidia-a800 gpu, but it can not work.
|
@nexa123 Did you find this model on huggingface or did you quantize it from scratch yourself? |
I use texamples/basic_quant.py to quantize model. and use example/basic_generate.py to observe result. |
The author of exllama release new [implementation(https://github.com/turboderp/exllamav2/tree/master), which is faster the old one. So would this PR refactor base exllamav2? |
For reference, the AutoAWQ main branch is now 5-10% slower than ExLlama V2 according to my benchmarks for token generation. We are hitting the limits of how much faster models can run. Only thing left to optimize is context processing. |
I tested two implementation on nvidia a800. Exllamav2 is faster 80%-90% than the older one. But the most fast implementation of awq in my tests is lmdeploy. |
Yes, for production deployments, you want to leverage either vLLM or LMDeploy. I believe vLLM is faster if you use larger batch sizes. AutoAWQ is meant to have relatively fast generation and ease of access. |
This is an integration of the ExLlama kernels.
My initial notes:
I invite anyone who wants to try to make this work to open PRs. @qwopqwop200