You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment of writing, there are two general-purpose post-training quantization methods: AutoGPTQ and AutoAWQ.
It's a bit tricky to choose to what approach to use, since different models might show different results.
For instance, in this comparison with LlaMa2-7b (on 3090) GPTQ shows much lower VRAM usage and faster token generation, albeit with slightly higher perplexity.
But in this test by Huggingface (on A100), AWQ with Zyphyr-7B shows slightly lower VRAM usage (at small batch sizes) and higher throughput (on larger batch sizes), but higher latency.
I did a quick sanity check on A10G with facebook/opt-125m and AWQ was around 30% slower to generate 1k tokens.
So, in overall, I'd say it's better to stick to AutoGPTQ for now.
The original implementation that we use is slow and hardcodes the list of datasets to use https://github.com/Lightning-AI/lit-gpt/blob/e095ed300dd9ffbca89c6416eeb056b08869721f/quantize/gptq.py#L474-L481 and layers to quantize: https://github.com/Lightning-AI/lit-gpt/blob/e095ed300dd9ffbca89c6416eeb056b08869721f/quantize/gptq.py#L498-L516
One example could be https://github.com/PanQiWei/AutoGPTQ. The library is specific to
huggingface/transformers
modules but it could be forked to support regulartorch.nn.Modules
.Any other suggestions are appreciated.
The text was updated successfully, but these errors were encountered: