Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

Closed
0xdevalias opened this issue Nov 22, 2023 · 10 comments
Closed

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

0xdevalias opened this issue Nov 22, 2023 · 10 comments
Labels
enhancement New feature or request stale

Comments

@0xdevalias
Copy link

0xdevalias commented Nov 22, 2023

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

It sounds like it's a fast/useful quantisation method:

Possible Implementation

N/A

@BarfingLemurs
Copy link
Contributor

#1256
AFAIK, you can mix all k-quants in the same model with no performance issue, but no one has felt a need to make a preset lower than 3.4bpw. (Q2_K - mostly q3_k)

#1106 The current models are already optimal, functioning better than GPTQ:

"As far as I can tell, we are now on par with best known GPTQ result for 7B, and better for 13B by about 0.05."

If you are hoping for faster cuda, https://github.com/JohannesGaessler says he wants to make improvements, but would be busy until end of December.

@Green-Sky
Copy link
Collaborator

Green-Sky commented Nov 22, 2023

iirc, only llama and to a degree falcon use an mix of kquants, that has been hand optimized for low perplexity. So there might be unused optimizations on the table for k-quant mixes.

edit: this info might be out-of-date, so if anyone has an update on that, please let me know :)

@8XXD8
Copy link

8XXD8 commented Nov 22, 2023

With exl2 you can fit a 70b model into 24gb of vram, but for 70b_q2 even 32gb is not enough.
If k-quants have similar quality to exl2 at the same bpw, then it might be worthwhile to go below q2

@Green-Sky
Copy link
Collaborator

just to compare, I am running 70B on 32gig ram + 8(7)gig vram:

llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q3_K - Small
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 27.86 GiB (3.47 BPW)
llm_load_print_meta: general.name   = LLaMA v2

so Q3_k_s -> 3.47 BPW, and thats basically the lowest quality I would go, anything below that really shows.

It would be very cool if we could compare the perplexity values between exl2 and llama.cpp at the same BPW.

@KerfuffleV2
Copy link
Collaborator

It might be possible to add just dequantization support for some of those other formats. Quantizing can be complicated, dequantizing usually isn't too bad. Also those projects probably already have stuff like CUDA kernels available that could be yoinked if they have a compatible license.

@agnosticlines
Copy link

agnosticlines commented Dec 29, 2023

Is this something the developers are interested in/willing to add support for? Just trying to understand what's out there currently in terms of Mac LLM tech. I know this is no small feat I'm just trying to see if it's something that is in the roadmap/pipeline or if it's something the developers specifically do not want to implement.

@github-actions github-actions bot added the stale label Mar 19, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
@sammcj
Copy link

sammcj commented Apr 3, 2024

Does this mean llama.cpp won’t be adding support for exl2 or GPTQ?

@ggerganov
Copy link
Member

Does this mean llama.cpp won’t be adding support for exl2 or GPTQ?

See #4704 (comment)

@txhno
Copy link

txhno commented Jul 19, 2024

Still seeking EXL2 support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

9 participants