-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ / ExLlamaV2 (EXL2) quantisation #4165
Comments
#1256 #1106 The current models are already optimal, functioning better than GPTQ:
If you are hoping for faster cuda, https://github.com/JohannesGaessler says he wants to make improvements, but would be busy until end of December. |
iirc, only llama and to a degree falcon use an mix of kquants, that has been hand optimized for low perplexity. So there might be unused optimizations on the table for k-quant mixes. edit: this info might be out-of-date, so if anyone has an update on that, please let me know :) |
With exl2 you can fit a 70b model into 24gb of vram, but for 70b_q2 even 32gb is not enough. |
just to compare, I am running 70B on 32gig ram + 8(7)gig vram:
so Q3_k_s -> 3.47 BPW, and thats basically the lowest quality I would go, anything below that really shows. It would be very cool if we could compare the perplexity values between exl2 and llama.cpp at the same BPW. |
It might be possible to add just dequantization support for some of those other formats. Quantizing can be complicated, dequantizing usually isn't too bad. Also those projects probably already have stuff like CUDA kernels available that could be yoinked if they have a compatible license. |
Is this something the developers are interested in/willing to add support for? Just trying to understand what's out there currently in terms of Mac LLM tech. I know this is no small feat I'm just trying to see if it's something that is in the roadmap/pipeline or if it's something the developers specifically do not want to implement. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Does this mean llama.cpp won’t be adding support for exl2 or GPTQ? |
See #4704 (comment) |
Still seeking EXL2 support! |
Feature Description
Please provide a detailed written description of what you were trying to do, and what you expected
llama.cpp
to do as an enhancement.Motivation
It sounds like it's a fast/useful quantisation method:
Possible Implementation
N/A
The text was updated successfully, but these errors were encountered: