-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPTQ quantization(3 or 4 bit quantization) support for LLaMa #177
Comments
That's very interesting and promising @qwopqwop200. Do you think that this can be generalized to any model through some wrapper like this?
|
I think it's difficult if the implementation of the model is not constant. |
Thanks for the clarifications. If my 2 brain cells did the math right, 4-bit would allow llama-30b to be loaded with about 20GB VRAM. Having that in the web UI would be very nice. |
I am currently experimenting on windows 11 and installed cuda kernel.. |
Another question: I see no mention of temperature, top_p, top_k, etc in the code. Is it possible to use those parameters somehow? |
My code is based on GPTQ and GPTQ only supports benchmark code for simplicity. |
Already writing implementations for 4-bit, love it. How fast is the inference time when running llama 30B 4-bit on a 3090? |
To be honest, it is not clear to me how to implement this because there is no inference code with some examples to follow. Also, without It seems like bitsandbytes will have int4 support soon huggingface/transformers#21955 (comment), but that will probably not be equivalent to GPTQ. Figure 1 in the paper shows a comparison between naive 4-bit quantization (which they call RTN, "round-to-nearest") and their approach, and it is clear that the difference is huge: https://arxiv.org/pdf/2210.17323.pdf |
I'm working on converting all the llama variants to 3-bit, keep an eye on the decapoda-research. I'll update here when they're available. |
Super, @zoidbb! |
@MetaIX I received this error awhile ago and according to Google, it happens when you don't have nccl installed. |
@qwopqwop200 are you aware of any 3-bit or 4-bit inference methods? I can't find anything beyond some theoretical proposal that never got implemented. Without an implementing of 3 or 4-bit inference, there's no way to go forward. bitsandbytes will have 4-bit inference soon, at which point we should be able to load a 4-bit model quantized via GPTQ and use the bitsandbytes 4-bit inference function against it. |
https://mobile.twitter.com/Tim_Dettmers/status/1605209177919750147 "The case for 4-bit precision: k-bit Inference Scaling Laws" 3-bit inference results were not too promising across these models in that paper. Their conclusion was that 4-bit is the sweet spot. I expect 4-bit will be superior quality. I would love to be surprised though. |
@xNul Thanks for the info. I had some weird stuff going on in the env lol. @qwopqwop200 So this should be relatively easier to implement since you already did most of the heavy lifting. |
https://huggingface.co/decapoda-research/llama-smallint-pt Quantized checkpoints for 7b/13b/30b are available in both 3-bit and 4-bit. The 3-bit files are the same size as the 4-bit files, amusingly -- likely due to how they're packed. These are not wrapped with Transformers magic, so good luck. Also not sure how to use them for actual inference yet. Will work that out later this week if no one else gets to it. There seem to be some clues in the OPT and BLOOM code inside the GPQT repository. 65b is almost done quantizing, should have those up within the next couple hours in the same repo. |
Something seems off. LLaMA-30B is ~60GB in fp16. I would expect it to be around 1/4 of that size in 4bit, ie. 15GB. If LLaMA-30B fits on a 16GB card in 4-bit with room to spare I'll be very very surprised. Good work, either way! We're getting somewhere. |
Agreed, its quite odd that the 4-bit output is this small. Once I better understand how this works (I haven't had a chance to dig in deep) I might know better why this is happening, and whether this result is incorect. |
It's probably this small because it's 3-bit quantization. |
https://arxiv.org/abs/2212.09720 That paper is zero-shot quantization and according to the paper gptq achieves more robust results at lower bits. |
So I don't think 3-bit is worth the effort. To gain real benefits, we would need a working, well-maintained 3-bit CUDA kernel. The CUDA kernel provided by the original GPTQ authors is extremely specialized and pretty much unmaintained by them or any community. The benefits of GPTQ for 4-bit quantization is negligible vs RTN, so GPTQ really only has a place in 2/3-bit quant. Eventually it would be nice to have this, but given the lack of a robust 3-bit CUDA kernel this is a non-starter for any real project. Lastly, the engineering behind the original GPTQ codebase is suspect. There are bugs all over the place, it's poorly organized, and poorly documented. It would take more work to turn this into a useful library and maintain it, than is worth at current. bitsandbytes will be releasing 4-bit support at some point relatively soon. I think it would be best to wait for that, as integration into the existing Transformers library should be straight-forward from that point given the existing 8-bit quantization support. My two cents, hold off on implementation until we see 4-bit from bitsandbytes. |
Taking a closer look at the plot, it seems like the difference between GPTQ and RTN at the ranges we are (or I am) most interested in (10-30b parameters) is indeed not that significant: The idea of lightly re-optimizing the weights to make up for the loss in accuracy is very appealing though. I hope that it will become a standard in the future. |
@zoidbb I am confused, forgetting about 3-bit, will your converted GPTQ 4-bit weights be usable in transformers when the 4-bit bitsandbytes implementation is complete and integrated into transformers or not? |
My code is just for experimentation. Therefore, it may be better to use bitsandbytes. |
I changed the code to use triton. I actually experienced a very high speedup. |
I think you would mean "Since triton does not support Windows...?" |
Judging from pip statistics, there may be hundreds of thousands of people running this on Windows. So I propose not breaking Windows support for them. That said, as a Linux user I am excited to hear about the speedup. |
I look forward to trying it out, can you quantify how much faster? |
Does oobabooga need to be update code to support this or should it work simply by switching to your triton branch and running "python setup_cuda.py install"? |
Update from me: int4matmul_kernels supports group quantization and dense int3 matmul now. If you're using a non Small informal speed test I ran gave median generation time of ~19s on GPTQ-for-LLaMa and ~4.8s with int4matmul_kernels (commit 610fdae of GPTQ-for-LLaMa, 2fde50a of webui; LLaMA-7B int3 g128, default sampler, 80 tokens with 1968 context, I haven't compared against triton yet (too lazy to set up WSL2), but I'd expect triton to be slightly faster assuming it properly specializes matvec mults. Note that this won't work with On a less practical note, reduced-kobold also supports int3 and group quant now. Interestingly, group quantization didn't benefit 4-bit sparse much. 4-bit sparse seems to still outperform int3 without group quantization (~7.85 ppl vs. 8.07) even with newer stuff like activation order. I also got better results with int3 g128 than I expected (~6.48 ppl vs. 6.61) for no apparent reason so maybe it's just weird luck with calibration data. Who knows. |
Same story for me as triton. No pascal support. These are really the cheapest 24g cards right now and performance isn't that bad. A 30b model replying in ~30s at pretty much full context for $200 if you get a p40.... |
how many token, do you have, for P40 and 30b model - 4 bit ? What cheap card do you advise with Ryzen 5 2400g and Motherboard b450M ? P40, M40, Mi25 ....? i want to use vicuna 30b correctly. thanks |
M series cards like M40 can't do 4bit. But here's a table for you:
Recommendation: P40 |
Hello, Thank you very much, this is exactly what I have been looking for for weeks. Do you have a review for the Radeon MI25? This graphics card is not expensive too. |
I created a new branch called old cuda branch. Speedup over existing old cuda branch. @oobabooga Provides approximately 25% faster speeds than conventional branches. |
@gandolfi974 be careful.. I have a B450 and with P40 my board did not boot. I had a 1700x though, maybe something different about that generation of proc and memory management. @qwopqwop200 This branch doesn't support act order with group size as before? Or act order isn't supported at all? |
This branch is not supported only when used together with group size and act order. |
Thanks.
So, on which hardware do you use your P40 card ? |
Yes, all that was enabled. Maybe it couldn't handle 2 24gb cards together with the P6000. I do not have onboard video so couldn't test that. I use my P40 on this It feels slower than the P6000 |
P6000 has a ~15% higher GPU clock. I have one and also notice it is slightly faster. Not worth paying 4x more though imho, especially when you can get a 3090 for the same price. |
For me it was $400 for P6000 or $6-700 for the 3090. Plus the 3090 wouldn't work in earlier windows. Now I got one going in the server so joke is on me. |
Ebay prices for P6000 went up in the past couple of months to $600-$700, while 3090s are down to $400-$600. The GPU market is strange. |
I paid closer to 700 with taxes now for a 3090.. maybe if I bought in march I would have paid $400. Now they are rising again. I think you can still get one if you bid. AI getting popular? |
i have bought P40 on ebay (199 usd) |
Don't forget to get a cooling shroud if you don't have a server with blowers already. The P40 does not come with its own cooling. Look up "p40 fan kit" on eBay. Should be about $20, give or take $10. |
|
Models will be slower to initially load if you don't have as much ram as the models are large. For some operations (quantization, for example) you want as much RAM as the 16bit version of the model. But you can get away with just having swap space, especially if your drive is NVMe. In short, 16gb is okay but suboptimal. Just make sure you have enough swap space. I use Arch. But just about any Linux distro should be fine. Ubuntu 22.04 LTS is a good option. Since it is "Long Term Release" it's less likely to have updates that will break things. That's generally a good idea if you don't need bleeding edge versions things. |
Would you advise me a cheap server or a second hand config to use the P40 for 30b - 4 bit version ? |
Ml350 sounds good if the cards will fit. Mine is a https://www.supermicro.com/products/system/4U/4028/SYS-4028GR-TRT.cfm |
These numbers, where did you get them from? My 4090's 4-bit GPTQ 30B is quicker at generating longer outputs, around 15-18 tokens per second. However, it appears to be limited by my Ryzen 5600 CPU, as a single core is always at 100% when producing the outputs. |
These are figures I compiled based on word of mouth and personal experience. There's been ~30% speed improvements for newer hardware since then. There's also a ~20% difference between Cuda on Windows vs Triton on Linux, and other considerations (like single core CPU performance). More recently, the "exllama" fork of Transformers gives a 150-200% speed-up on the 4090. https://github.com/turboderp/exllama/ You could be getting up to 45 tokens/second on 30B with full context (6x faster than a P40). Check it out. :) |
Heh.. triton is slower no matter what on 3090. 3090 is closer to $700 used now or more. Keeps going up. I did 2x3090 vs 3090+P40 for the llama 65b using the llama_offload method and speed went from 1.80t/s to 2.30 t/s.. not as huge of a jump as I thought. Maybe it would be better for exllama but I bet that uses accelerate to split and that likes to OOM and 12-1500 context. From everyone's 4090 benches, it looks to be much faster than the 3090. As the 3090 price rises, it may make more sense to just buy that. As for pegging the CPU.. on all my computers it maxes a core at 100%, that is because python is single threaded for this. Ryzen or 16 core xeon, doesn't matter. Some, like the old RWKV at least used all cores during loading. |
GPTQ is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
I've actually confirmed that this works well in LLaMa 7b.
I haven't tested the memory usage(n-bit cuda kernel), but I think it should work.
code: https://github.com/qwopqwop200/GPTQ-for-LLaMa
The text was updated successfully, but these errors were encountered: