GPTQ Collaboration? #75

dalistarh · 2023-03-23T20:56:06Z

Dear Qwopqwop200,

I'm writing on behalf of the authors of the GPTQ paper. We have been following your excellent work, and wanted to mention that we added a few updates to our repository yesterday, which may be interesting to you:

We added a minimal LlaMa integration demonstrating a few additional tricks which lead to some accuracy improvements (especially on the 7B model; for instance GPTQ is now consistently better than RTN).
Further, we pushed a significantly faster 3-bit kernel (optimized for the A100) and slightly adjusted evaluation procedures for PTB and C4, which are used in the camera-ready version of our paper.

In case you would be interested in collaborating more closely with us, please feel free to write us at dan.alistarh@ist.ac.at / elias.frantar@ist.ac.at

Best regards,
Dan

Wingie · 2023-03-24T14:43:55Z

Hey Dan, nice to hear from you.
we are all playing with your project and love it!

i had a couple of questions for you also regarding kernels. -
do you recommend 3 bit or 4 bit kernels?
do you think gptq can also help in non-cuda implementations? like ggerganov/llama.cpp#397 i believe that they dont have cuda kernels as its a pure cpp implementation, i think they perform RTN quantization which has this effect of degrading the model (in my observations) basically wondering if there was a way to port these gptq algorithm to the cpp repo?

MarkSchmidty · 2023-03-25T18:08:09Z

@Wingie llama.cpp has supported (4bit) GPTQ inference for 4 days now. There is a script in that repo called convert-gptq-to-ggml.py to get you started.

do you think gptq can also help in non-cuda implementations?

GPTQ is indeed better than RtN even in pure CPU implementations.

3 bit or 4 bit kernels?

With the latest optimizations to GPTQ, 13B 3bit is superior to 7B 4bit and 30B 3bit is superior to 13B 4bit, etc. So you will likely want to optimize for the maximum amount of parameters you can fit in the RAM/VRAM you have. If you have memory to spare then more bits may produce marginally better results at the same parameter count.

MarkSchmidty · 2023-03-25T18:13:18Z

https://github.com/IST-DASLab/gptq is the repository mentioned in the OP, for anyone who comes across this thread.

sterlind · 2023-03-26T04:17:35Z

@dalistarh this is just a gardening thing, but I submitted a PR to this repo to make it pip-installable. I briefly browsed your repo, and think it should more or less just work for your repo as well, if you want to borrow it (or maybe @qwopqwop200 will upstream their repo to yours.) Just FYI!

Orevantum mentioned this issue Mar 25, 2023

GPTQ Quantization (3-bit and 4-bit) ggerganov/llama.cpp#9

Closed

prusnak mentioned this issue Apr 1, 2023

Investigate alternative approach for Q4 quantization ggerganov/llama.cpp#397

Closed

qwopqwop200 closed this as completed Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ Collaboration? #75

GPTQ Collaboration? #75

dalistarh commented Mar 23, 2023

Wingie commented Mar 24, 2023

MarkSchmidty commented Mar 25, 2023 •

edited

Loading

MarkSchmidty commented Mar 25, 2023

sterlind commented Mar 26, 2023

GPTQ Collaboration? #75

GPTQ Collaboration? #75

Comments

dalistarh commented Mar 23, 2023

Wingie commented Mar 24, 2023

MarkSchmidty commented Mar 25, 2023 • edited Loading

MarkSchmidty commented Mar 25, 2023

sterlind commented Mar 26, 2023

MarkSchmidty commented Mar 25, 2023 •

edited

Loading