-
-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add QuiP quant support #217
base: master
Are you sure you want to change the base?
Conversation
Very cool. Thank you for making this, will be awesome when it's done 👍 |
Would like to have this feature :) |
It might be worth nothing that most of the what QuIP# appears to be achieving seems to be acheived better by llama.cpp's new quantization schema. the dedicated 2 bit quants (IQ2_XS and XXS, 2.03 and 2.3 BPW - as opposed to being more like 3 bits) are very strong.* There's now an optimisation scheme involved but it's clearly not a 'generate FP64 hessians for a week (no really that's what they suggested - for 6k ctx) on your grandma's Threadripper X cluster' - it's much more like the 'discover the most important weights by throwing words at the decoder' - schema we're familiar with here. ikawrakow here is well worth a read. I'll link this technical discussion I found instead of the unsightly spat with one of the Cornell team. I'm long past a claim to being any sort of computer scientist but I'd like to hope EXL2's inheritance from GPTQ (quantize-then-calibrate as GPTQ's design goal, flexibility in weight assignments added by EXL2 from that) could make EXL2 itself a better use for the methods used here than QuIP#. Those imatrix files are bigger than exl2's measurements but 25mb isn't exactly out of reach here, compared to 2mb? (...Could these just convert directly? Probably not without the E8 enumeration support but I do wonder what exactly is in a GGUF that isn't in a EXL2 or vice-versa.) *IQ3_XXS is Absolutely Robust it's scary. It's very new. It's making me plug in a 3070 Ti that I ought to sell. |
This is draft PR for adding QuiP quant into ExllamaV2.
Original QuiP Repo
Works:
Ppl performance benchmark
using dataset: [wikitext-2-v1_validation_0000.parquet]
(https://huggingface.co/datasets/wikitext/tree/refs%2Fconvert%2Fparquet/wikitext-2-v1/validation)
sample cmd
inference example
7B 2bit E8P
13B 2bit E8P
70B 2bit E8P