Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace our GPTQ implementation with something better #583

Closed
carmocca opened this issue Sep 26, 2023 · 1 comment · Fixed by #889
Closed

Replace our GPTQ implementation with something better #583

carmocca opened this issue Sep 26, 2023 · 1 comment · Fixed by #889
Labels
enhancement New feature or request quantization

Comments

@carmocca
Copy link
Contributor

The original implementation that we use is slow and hardcodes the list of datasets to use https://github.com/Lightning-AI/lit-gpt/blob/e095ed300dd9ffbca89c6416eeb056b08869721f/quantize/gptq.py#L474-L481 and layers to quantize: https://github.com/Lightning-AI/lit-gpt/blob/e095ed300dd9ffbca89c6416eeb056b08869721f/quantize/gptq.py#L498-L516

One example could be https://github.com/PanQiWei/AutoGPTQ. The library is specific to huggingface/transformers modules but it could be forked to support regular torch.nn.Modules.

Any other suggestions are appreciated.

@Andrei-Aksionov
Copy link
Collaborator

At the moment of writing, there are two general-purpose post-training quantization methods: AutoGPTQ and AutoAWQ.

It's a bit tricky to choose to what approach to use, since different models might show different results.
For instance, in this comparison with LlaMa2-7b (on 3090) GPTQ shows much lower VRAM usage and faster token generation, albeit with slightly higher perplexity.
But in this test by Huggingface (on A100), AWQ with Zyphyr-7B shows slightly lower VRAM usage (at small batch sizes) and higher throughput (on larger batch sizes), but higher latency.

I did a quick sanity check on A10G with facebook/opt-125m and AWQ was around 30% slower to generate 1k tokens.

So, in overall, I'd say it's better to stick to AutoGPTQ for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request quantization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants