Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

Closed
qwopqwop200 opened this issue Mar 6, 2023 · 215 comments
Closed

GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177

qwopqwop200 opened this issue Mar 6, 2023 · 215 comments
Labels
enhancement New feature or request

Comments

@qwopqwop200
Copy link

qwopqwop200 commented Mar 6, 2023

GPTQ is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
I've actually confirmed that this works well in LLaMa 7b.
I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.

Model(LLaMa-7B) Bits group-size Wikitext2 PTB C4
FP16 16 - 5.67 8.79 7.05
RTN 4 - 6.28 9.68 7.70
GPTQ 4 64 6.16 9.66 7.52
RTN 3 - 25.66 61.25 28.19
GPTQ 3 64 12.24 16.77 9.55

code: https://github.com/qwopqwop200/GPTQ-for-LLaMa

@qwopqwop200 qwopqwop200 changed the title GPTQ quantization(4 bit quantization) support for LLaMa GPTQ quantization(3 or 4 bit quantization) support for LLaMa Mar 6, 2023
@oobabooga
Copy link
Owner

That's very interesting and promising @qwopqwop200. Do you think that this can be generalized to any model through some wrapper like this?

model  = AutoModelForCausalLM.from_pretrained(...)
model = convert_to_4bit(model)

output_ids = model.generate(input_ids)

@qwopqwop200
Copy link
Author

qwopqwop200 commented Mar 6, 2023

I think it's difficult if the implementation of the model is not constant.
For example, OPT and bloom are mostly similar, but the architecture is different in some parts.
For example, in positional embedding, opt uses LearnedPositionalEmbedding, while Bloom uses ALiBi.
Due to these differences, some parts of the code may be different.
However, most of the code is the same. If you cope with these differences, I think you can be compatible with most(not all) Transformer architectures.

@oobabooga
Copy link
Owner

Thanks for the clarifications. If my 2 brain cells did the math right, 4-bit would allow llama-30b to be loaded with about 20GB VRAM. Having that in the web UI would be very nice.

@oobabooga oobabooga added the enhancement New feature or request label Mar 6, 2023
@oobabooga oobabooga pinned this issue Mar 6, 2023
@MetaIX
Copy link
Contributor

MetaIX commented Mar 6, 2023

I would love to see this.. imagine the possibilities. Also, does this work on windows?

Kept getting this error.

image

I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.

image

@qwopqwop200
Copy link
Author

I would love to see this.. imagine the possibilities. Also, does this work on windows?

Kept getting this error.

image

I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.

image

I am currently experimenting on windows 11 and installed cuda kernel..
If you can't install it on Windows, you can also use wsl2.

@oobabooga
Copy link
Owner

Another question: I see no mention of temperature, top_p, top_k, etc in the code. Is it possible to use those parameters somehow?

@qwopqwop200
Copy link
Author

My code is based on GPTQ and GPTQ only supports benchmark code for simplicity.
Therefore, you need to create a separate code for inference.
like this code

@musicurgy
Copy link

musicurgy commented Mar 6, 2023

Already writing implementations for 4-bit, love it. How fast is the inference time when running llama 30B 4-bit on a 3090?

@oobabooga
Copy link
Owner

oobabooga commented Mar 6, 2023

To be honest, it is not clear to me how to implement this because there is no inference code with some examples to follow. Also, without temperature, repetition_penalty, top_p and top_k (specifically those 4 parameters), the results would not be good. Maybe someone can help?

It seems like bitsandbytes will have int4 support soon huggingface/transformers#21955 (comment), but that will probably not be equivalent to GPTQ. Figure 1 in the paper shows a comparison between naive 4-bit quantization (which they call RTN, "round-to-nearest") and their approach, and it is clear that the difference is huge: https://arxiv.org/pdf/2210.17323.pdf

@dustydecapod
Copy link
Contributor

I'm working on converting all the llama variants to 3-bit, keep an eye on the decapoda-research. I'll update here when they're available.

@oobabooga
Copy link
Owner

Super, @zoidbb!

@xNul
Copy link
Contributor

xNul commented Mar 6, 2023

I would love to see this.. imagine the possibilities. Also, does this work on windows?

Kept getting this error.

image

I assume this might be because I couldn't properly install the CUDA extension, as I was also met with this error.

image

@MetaIX I received this error awhile ago and according to Google, it happens when you don't have nccl installed.

@dustydecapod
Copy link
Contributor

@qwopqwop200 are you aware of any 3-bit or 4-bit inference methods? I can't find anything beyond some theoretical proposal that never got implemented. Without an implementing of 3 or 4-bit inference, there's no way to go forward.

bitsandbytes will have 4-bit inference soon, at which point we should be able to load a 4-bit model quantized via GPTQ and use the bitsandbytes 4-bit inference function against it.

@MarkSchmidty
Copy link

MarkSchmidty commented Mar 6, 2023

https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147
"Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision."
image

"The case for 4-bit precision: k-bit Inference Scaling Laws"
https://arxiv.org/abs/2212.09720

3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though.

@MetaIX
Copy link
Contributor

MetaIX commented Mar 6, 2023

@xNul Thanks for the info. I had some weird stuff going on in the env lol.

@qwopqwop200 So this should be relatively easier to implement since you already did most of the heavy lifting.

@dustydecapod
Copy link
Contributor

https://huggingface.co/decapoda-research/llama-smallint-pt

Quantized checkpoints for 7b/13b/30b are available in both 3-bit and 4-bit. The 3-bit files are the same size as the 4-bit files, amusingly -- likely due to how they're packed. These are not wrapped with Transformers magic, so good luck. Also not sure how to use them for actual inference yet. Will work that out later this week if no one else gets to it. There seem to be some clues in the OPT and BLOOM code inside the GPQT repository.

65b is almost done quantizing, should have those up within the next couple hours in the same repo.

@MarkSchmidty
Copy link

image

Something seems off. LLaMA-30B is ~60GB in fp16. I would expect it to be around 1/4 of that size in 4bit, ie. 15GB.
12GB is considerably smaller and about the size I would expect 3-bit to be if it was stored efficiently.

If LLaMA-30B fits on a 16GB card in 4-bit with room to spare I'll be very very surprised.

Good work, either way! We're getting somewhere.

@dustydecapod
Copy link
Contributor

dustydecapod commented Mar 7, 2023

Agreed, its quite odd that the 4-bit output is this small. Once I better understand how this works (I haven't had a chance to dig in deep) I might know better why this is happening, and whether this result is incorect.

@qwopqwop200
Copy link
Author

동의합니다. 4비트 출력이 이렇게 작다는 것은 상당히 이상합니다. 이것이 어떻게 작동하는지 더 잘 이해하면(깊이 파헤칠 기회가 없었습니다) 왜 이런 일이 발생하는지, 그리고 이 결과가 잘못된 것인지 더 잘 알 수 있습니다.

It's probably this small because it's 3-bit quantization.
As of now, the code does not support 4-bit quantization.

@qwopqwop200
Copy link
Author

https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "Our analysis is extensive, spanning 5 models (BLOOM, BLOOM, Pythia, GPT-2, OPT), from 3 to 8-bit precision, and from 19M to 66B scale. We find the same result again and again: bit-level scaling improves from 16-bit to 4-bit precision but reverses at 3-bit precision." image

"The case for 4-bit precision: k-bit Inference Scaling Laws" https://arxiv.org/abs/2212.09720

3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though.

https://arxiv.org/abs/2212.09720

That paper is zero-shot quantization and according to the paper gptq achieves more robust results at lower bits.
It can be found in Table 1, Figure 5 of the paper.

@dustydecapod
Copy link
Contributor

So I don't think 3-bit is worth the effort. To gain real benefits, we would need a working, well-maintained 3-bit CUDA kernel. The CUDA kernel provided by the original GPTQ authors is extremely specialized and pretty much unmaintained by them or any community.

The benefits of GPTQ for 4-bit quantization is negligible vs RTN, so GPTQ really only has a place in 2/3-bit quant. Eventually it would be nice to have this, but given the lack of a robust 3-bit CUDA kernel this is a non-starter for any real project.

Lastly, the engineering behind the original GPTQ codebase is suspect. There are bugs all over the place, it's poorly organized, and poorly documented. It would take more work to turn this into a useful library and maintain it, than is worth at current.

bitsandbytes will be releasing 4-bit support at some point relatively soon. I think it would be best to wait for that, as integration into the existing Transformers library should be straight-forward from that point given the existing 8-bit quantization support.

My two cents, hold off on implementation until we see 4-bit from bitsandbytes.

@oobabooga
Copy link
Owner

Taking a closer look at the plot, it seems like the difference between GPTQ and RTN at the ranges we are (or I am) most interested in (10-30b parameters) is indeed not that significant:

out

The idea of lightly re-optimizing the weights to make up for the loss in accuracy is very appealing though. I hope that it will become a standard in the future.

@oobabooga
Copy link
Owner

@zoidbb I am confused, forgetting about 3-bit, will your converted GPTQ 4-bit weights be usable in transformers when the 4-bit bitsandbytes implementation is complete and integrated into transformers or not?

@qwopqwop200
Copy link
Author

그래서 저는 3비트가 그만한 가치가 있다고 생각하지 않습니다. 실질적인 이점을 얻으려면 제대로 작동하고 관리가 잘 되는 3비트 CUDA 커널이 필요합니다. 원래 GPTQ 작성자가 제공하는 CUDA 커널은 매우 전문화되어 있으며 그들 또는 커뮤니티에서 거의 유지 관리하지 않습니다.

4비트 양자화에 대한 GPTQ의 이점은 RTN에 비해 무시할 수 있으므로 GPTQ는 실제로 2/3비트 양자화에서만 자리를 차지합니다. 궁극적으로 이것이 있으면 좋겠지만 강력한 3비트 CUDA 커널이 없기 때문에 실제 프로젝트의 시작이 아닙니다.

마지막으로 원래 GPTQ 코드베이스의 엔지니어링이 의심스럽습니다. 도처에 버그가 있고 제대로 구성되어 있지 않으며 문서화가 제대로 되어 있지 않습니다. 이것을 유용한 라이브러리로 바꾸고 유지 관리하려면 현재 가치보다 더 많은 작업이 필요합니다.

bitsandbytes는 비교적 빠른 시일 내에 4비트 지원을 해제할 예정입니다. 기존 Transformers 라이브러리로의 통합은 기존 8비트 양자화 지원을 고려할 때 그 시점부터 간단해야 하므로 이를 기다리는 것이 가장 좋을 것이라고 생각합니다.

내 두 센트, 비트와 바이트에서 4비트를 볼 때까지 구현을 보류하십시오.

My code is just for experimentation. Therefore, it may be better to use bitsandbytes.

@qwopqwop200
Copy link
Author

qwopqwop200 commented Mar 30, 2023

I changed the code to use triton. I actually experienced a very high speedup.
Since triton support only Linux, Windows users should be encouraged to use WSL2.

@sgsdxzy
Copy link
Contributor

sgsdxzy commented Mar 30, 2023

I changed the code to use triton. I actually experienced a very high speedup. Since triton does not support Linux, Windows users should be encouraged to use WSL2.

I think you would mean "Since triton does not support Windows...?"
So is the triton version faster than the cuda branch?

@MarkSchmidty
Copy link

Judging from pip statistics, there may be hundreds of thousands of people running this on Windows. So I propose not breaking Windows support for them.

That said, as a Linux user I am excited to hear about the speedup.

@musicurgy
Copy link

I changed the code to use triton. I actually experienced a very high speedup. Since triton support only Linux, Windows users should be encouraged to use WSL2.

I look forward to trying it out, can you quantify how much faster?

@jepjoo
Copy link

jepjoo commented Mar 30, 2023

I changed the code to use triton. I actually experienced a very high speedup. Since triton support only Linux, Windows users should be encouraged to use WSL2.

Does oobabooga need to be update code to support this or should it work simply by switching to your triton branch and running "python setup_cuda.py install"?

@mstnegate
Copy link

Update from me: int4matmul_kernels supports group quantization and dense int3 matmul now. If you're using a non --actorder model with CUDA, you might be able to get a nice speedup by swapping kernels.

Small informal speed test I ran gave median generation time of ~19s on GPTQ-for-LLaMa and ~4.8s with int4matmul_kernels (commit 610fdae of GPTQ-for-LLaMa, 2fde50a of webui; LLaMA-7B int3 g128, default sampler, 80 tokens with 1968 context, --no-stream, run on RTX 3080 10G.) As a bonus it also doesn't have to materialize a weights matrix. YMMV depending on hardware and model size, as usual.

I haven't compared against triton yet (too lazy to set up WSL2), but I'd expect triton to be slightly faster assuming it properly specializes matvec mults.

Note that this won't work with --actorder models, since I implemented activation order completely differently in reduced-kobold. Otherwise, the only other note is that you'll need to unpack zeros data; there's already code in qwopqwop200's repo for that where it materializes weights matrices.

On a less practical note, reduced-kobold also supports int3 and group quant now. Interestingly, group quantization didn't benefit 4-bit sparse much. 4-bit sparse seems to still outperform int3 without group quantization (~7.85 ppl vs. 8.07) even with newer stuff like activation order. I also got better results with int3 g128 than I expected (~6.48 ppl vs. 6.61) for no apparent reason so maybe it's just weird luck with calibration data. Who knows.

@oobabooga oobabooga unpinned this issue Apr 9, 2023
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 9, 2023

Same story for me as triton. No pascal support. These are really the cheapest 24g cards right now and performance isn't that bad. A 30b model replying in ~30s at pretty much full context for $200 if you get a p40....

@gandolfi974
Copy link

p40

how many token, do you have, for P40 and 30b model - 4 bit ?

What cheap card do you advise with Ryzen 5 2400g and Motherboard b450M ? P40, M40, Mi25 ....? i want to use vicuna 30b correctly.

thanks

@MarkSchmidty
Copy link

MarkSchmidty commented Apr 29, 2023

M series cards like M40 can't do 4bit. But here's a table for you:

Card Price VRAM Per GB 30B Tokens/s Per Token
P40 $200 24GB $8.33 8 $25.00
3090 $600 24GB $25.00 10 $60.00
a6000 $1800 48GB $37.50 10 $180.00
4090 $1400 24GB $58.33 12 $116.67
A100 $5500 40GB $114.58 ?? ??
A100 $9600 80GB $120.00 ?? ??
6000-ada $6800 48GB $141.67 12 $566.67

Recommendation: P40

@gandolfi974
Copy link

Hello, Thank you very much, this is exactly what I have been looking for for weeks. Do you have a review for the Radeon MI25? This graphics card is not expensive too.

@qwopqwop200
Copy link
Author

qwopqwop200 commented Apr 30, 2023

I created a new branch called old cuda branch. Speedup over existing old cuda branch. @oobabooga

Provides approximately 25% faster speeds than conventional branches.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Apr 30, 2023

@gandolfi974 be careful.. I have a B450 and with P40 my board did not boot. I had a 1700x though, maybe something different about that generation of proc and memory management.

@qwopqwop200 This branch doesn't support act order with group size as before? Or act order isn't supported at all?

@qwopqwop200
Copy link
Author

This branch is not supported only when used together with group size and act order.

@gandolfi974
Copy link

gandolfi974 commented Apr 30, 2023

be careful.. I have a B450 and with P40 my board did not boot. I had a 1700x though, maybe something different about that generation of proc and memory management.

Thanks.
interresting link for M40 installation. https://miyconst.github.io/hardware/gpu/nvidia/2021/05/23/nvidia-tesla-m40.html

  • Do you have "Above 4g Deconding" activate in the bios ?
  • Disable CSM - enable UEFI
  • install IGPU Drivers
  • make sure legacy boot stuff is completely disabled

So, on which hardware do you use your P40 card ?

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 1, 2023

Yes, all that was enabled. Maybe it couldn't handle 2 24gb cards together with the P6000. I do not have onboard video so couldn't test that.

I use my P40 on this

It feels slower than the P6000

@MarkSchmidty
Copy link

P6000 has a  ~15% higher GPU clock. I have one and also notice it is slightly faster. Not worth paying 4x more though imho, especially when you can get a 3090 for the same price.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 1, 2023

For me it was $400 for P6000 or $6-700 for the 3090. Plus the 3090 wouldn't work in earlier windows. Now I got one going in the server so joke is on me.

@MarkSchmidty
Copy link

Ebay prices for P6000 went up in the past couple of months to $600-$700, while 3090s are down to $400-$600. The GPU market is strange.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 1, 2023

I paid closer to 700 with taxes now for a 3090.. maybe if I bought in march I would have paid $400. Now they are rising again. I think you can still get one if you bid. AI getting popular?

@gandolfi974
Copy link

i have bought P40 on ebay (199 usd)
I will buy a cable like this for my motherboard (B450M MSI Bazooka V2) https://fr.aliexpress.com/item/1005005346642068.html

@MarkSchmidty
Copy link

MarkSchmidty commented May 2, 2023

i have bought P40 on ebay (199 usd) I will buy a cable like this for my motherboard (B450M MSI Bazooka V2) fr.aliexpress.com/item/1005005346642068.html

Don't forget to get a cooling shroud if you don't have a server with blowers already. The P40 does not come with its own cooling.

Look up "p40 fan kit" on eBay. Should be about $20, give or take $10.

@gandolfi974
Copy link

  • do you recommand a specific os for better performance with p40 and GPTQ ?
  • 16 gb of cpu ram is it ok ?

@MarkSchmidty
Copy link

MarkSchmidty commented May 3, 2023

Models will be slower to initially load if you don't have as much ram as the models are large. For some operations (quantization, for example) you want as much RAM as the 16bit version of the model. But you can get away with just having swap space, especially if your drive is NVMe.

In short, 16gb is okay but suboptimal. Just make sure you have enough swap space.

I use Arch. But just about any Linux distro should be fine. Ubuntu 22.04 LTS is a good option. Since it is "Long Term Release" it's less likely to have updates that will break things. That's generally a good idea if you don't need bleeding edge versions things.

@gandolfi974
Copy link

gandolfi974 commented May 4, 2023

Models will be slower to initially load if you don't have as much ram as the models are large. For some operations (quantization, for example) you want as much RAM as the 16bit version of the model. But you can get away with just having swap space, especially if your drive is NVMe.

In short, 16gb is okay but suboptimal. Just make sure you have enough swap space.

I use Arch. But just about any Linux distro should be fine. Ubuntu 22.04 LTS is a good option. Since it is "Long Term Release" it's less likely to have updates that will break things. That's generally a good idea if you don't need bleeding edge versions things.

Would you advise me a cheap server or a second hand config to use the P40 for 30b - 4 bit version ?
a friend propose me a ML350 gen9 (64 gb ram, 2to ssd, dual processor).

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented May 6, 2023

Ml350 sounds good if the cards will fit. Mine is a https://www.supermicro.com/products/system/4U/4028/SYS-4028GR-TRT.cfm
The 3090 power plugs prevent me from closing the top cover.

@shouyiwang
Copy link
Contributor

M series cards like M40 can't do 4bit. But here's a table for you:

Card Price VRAM Per GB 30B Tokens/s Per Token
P40 $200 24GB $8.33 8 $25.00
3090 $600 24GB $25.00 10 $60.00
a6000 $1800 48GB $37.50 10 $180.00
4090 $1400 24GB $58.33 12 $116.67
A100 $5500 40GB $114.58 ?? ??
A100 $9600 80GB $120.00 ?? ??
6000-ada $6800 48GB $141.67 12 $566.67
Recommendation: P40

These numbers, where did you get them from? My 4090's 4-bit GPTQ 30B is quicker at generating longer outputs, around 15-18 tokens per second. However, it appears to be limited by my Ryzen 5600 CPU, as a single core is always at 100% when producing the outputs.

@MarkSchmidty
Copy link

MarkSchmidty commented Jun 3, 2023

These are figures I compiled based on word of mouth and personal experience. There's been ~30% speed improvements for newer hardware since then. There's also a ~20% difference between Cuda on Windows vs Triton on Linux, and other considerations (like single core CPU performance).

More recently, the "exllama" fork of Transformers gives a 150-200% speed-up on the 4090. https://github.com/turboderp/exllama/

You could be getting up to 45 tokens/second on 30B with full context (6x faster than a P40). Check it out. :)

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 3, 2023

Heh.. triton is slower no matter what on 3090. 3090 is closer to $700 used now or more. Keeps going up. I did 2x3090 vs 3090+P40 for the llama 65b using the llama_offload method and speed went from 1.80t/s to 2.30 t/s.. not as huge of a jump as I thought. Maybe it would be better for exllama but I bet that uses accelerate to split and that likes to OOM and 12-1500 context.

From everyone's 4090 benches, it looks to be much faster than the 3090. As the 3090 price rises, it may make more sense to just buy that.

As for pegging the CPU.. on all my computers it maxes a core at 100%, that is because python is single threaded for this. Ryzen or 16 core xeon, doesn't matter. Some, like the old RWKV at least used all cores during loading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests