GPTQ / ExLlamaV2 (EXL2) quantisation #4165

0xdevalias · 2023-11-22T06:08:43Z

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

It sounds like it's a fast/useful quantisation method:

https://towardsdatascience.com/exllamav2-the-fastest-library-to-run-llms-32aeda294d26
- https://github.com/mlabonne/llm-course/blob/main/Quantize_models_with_ExLlamaV2.ipynb
https://towardsdatascience.com/4-bit-quantization-with-gptq-36b0f4f02c34
https://huggingface.co/blog/gptq-integration
https://oobabooga.github.io/blog/posts/gptq-awq-exl2-llamacpp/
- A detailed comparison between GPTQ, AWQ, EXL2, q4_K_M, q4_K_S, and load_in_4bit: perplexity, VRAM, speed, model size, and loading time.

Possible Implementation

N/A

The text was updated successfully, but these errors were encountered:

BarfingLemurs · 2023-11-22T09:29:00Z

#1256
AFAIK, you can mix all k-quants in the same model with no performance issue, but no one has felt a need to make a preset lower than 3.4bpw. (Q2_K - mostly q3_k)

#1106 The current models are already optimal, functioning better than GPTQ:

"As far as I can tell, we are now on par with best known GPTQ result for 7B, and better for 13B by about 0.05."

If you are hoping for faster cuda, https://github.com/JohannesGaessler says he wants to make improvements, but would be busy until end of December.

Green-Sky · 2023-11-22T13:24:23Z

iirc, only llama and to a degree falcon use an mix of kquants, that has been hand optimized for low perplexity. So there might be unused optimizations on the table for k-quant mixes.

edit: this info might be out-of-date, so if anyone has an update on that, please let me know :)

8XXD8 · 2023-11-22T14:53:48Z

With exl2 you can fit a 70b model into 24gb of vram, but for 70b_q2 even 32gb is not enough.
If k-quants have similar quality to exl2 at the same bpw, then it might be worthwhile to go below q2

Green-Sky · 2023-11-22T15:18:49Z

just to compare, I am running 70B on 32gig ram + 8(7)gig vram:

llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q3_K - Small
llm_load_print_meta: model params     = 68.98 B
llm_load_print_meta: model size       = 27.86 GiB (3.47 BPW)
llm_load_print_meta: general.name   = LLaMA v2

so Q3_k_s -> 3.47 BPW, and thats basically the lowest quality I would go, anything below that really shows.

It would be very cool if we could compare the perplexity values between exl2 and llama.cpp at the same BPW.

KerfuffleV2 · 2023-11-24T15:41:33Z

It might be possible to add just dequantization support for some of those other formats. Quantizing can be complicated, dequantizing usually isn't too bad. Also those projects probably already have stuff like CUDA kernels available that could be yoinked if they have a compatible license.

agnosticlines · 2023-12-29T14:14:03Z

Is this something the developers are interested in/willing to add support for? Just trying to understand what's out there currently in terms of Mac LLM tech. I know this is no small feat I'm just trying to see if it's something that is in the roadmap/pipeline or if it's something the developers specifically do not want to implement.

github-actions · 2024-04-03T01:15:19Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

sammcj · 2024-04-03T01:17:59Z

Does this mean llama.cpp won’t be adding support for exl2 or GPTQ?

ggerganov · 2024-04-03T06:21:38Z

Does this mean llama.cpp won’t be adding support for exl2 or GPTQ?

See #4704 (comment)

txhno · 2024-07-19T06:16:45Z

Still seeking EXL2 support!

0xdevalias added the enhancement New feature or request label Nov 22, 2023

0xdevalias mentioned this issue Nov 22, 2023

GPTQ / ExLlamaV2 (EXL2) quantisation ollama/ollama#1237

Open

github-actions bot added the stale label Mar 19, 2024

github-actions bot closed this as completed Apr 3, 2024

ChenMnZ mentioned this issue Jul 27, 2024

GGUF OpenGVLab/EfficientQAT#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

0xdevalias commented Nov 22, 2023 •

edited

Loading

BarfingLemurs commented Nov 22, 2023

Green-Sky commented Nov 22, 2023 •

edited

Loading

8XXD8 commented Nov 22, 2023

Green-Sky commented Nov 22, 2023

KerfuffleV2 commented Nov 24, 2023

agnosticlines commented Dec 29, 2023 •

edited

Loading

github-actions bot commented Apr 3, 2024

sammcj commented Apr 3, 2024

ggerganov commented Apr 3, 2024

txhno commented Jul 19, 2024

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

GPTQ / ExLlamaV2 (EXL2) quantisation #4165

Comments

0xdevalias commented Nov 22, 2023 • edited Loading

Feature Description

Motivation

Possible Implementation

BarfingLemurs commented Nov 22, 2023

Green-Sky commented Nov 22, 2023 • edited Loading

8XXD8 commented Nov 22, 2023

Green-Sky commented Nov 22, 2023

KerfuffleV2 commented Nov 24, 2023

agnosticlines commented Dec 29, 2023 • edited Loading

github-actions bot commented Apr 3, 2024

sammcj commented Apr 3, 2024

ggerganov commented Apr 3, 2024

txhno commented Jul 19, 2024

0xdevalias commented Nov 22, 2023 •

edited

Loading

Green-Sky commented Nov 22, 2023 •

edited

Loading

agnosticlines commented Dec 29, 2023 •

edited

Loading