-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AutoGPTQ integration #924
Closed
Andrei-Aksionov
wants to merge
41
commits into
Lightning-AI:main
from
Andrei-Aksionov:autogptq_integration
Closed
AutoGPTQ integration #924
Andrei-Aksionov
wants to merge
41
commits into
Lightning-AI:main
from
Andrei-Aksionov:autogptq_integration
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Andrei-Aksionov
requested review from
awaelchli,
carmocca and
lantiga
as code owners
February 12, 2024 16:14
Andrei-Aksionov
force-pushed
the
autogptq_integration
branch
from
February 18, 2024 14:18
0a8a0b1
to
3f69b53
Compare
Andrei-Aksionov
force-pushed
the
autogptq_integration
branch
from
February 18, 2024 14:36
3f69b53
to
68d8d61
Compare
Andrei-Aksionov
force-pushed
the
autogptq_integration
branch
from
March 1, 2024 14:30
58e4c74
to
3db21c0
Compare
Sadly, but it won't be merged. So I'm closing it. Spent quite a lot of time and was proud (and still am) how neatly I've managed to integrate into LitGPT something that is so tied to HF Transformers. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi there 👋
It's a bit late response to #583
The task itself turned out to be quite large, so in order to speed up the process (and simplify life for those who will review the PR) I decided to include only the basics: the code can quantize model and run an inference, supports all the AutoGPTQ kernels. The rest parts of AutoGPTQ functionality will be added in a subsequent pull requests.
This PR doesn't include:
no I don't)Benchmarks
Benchmarking was done on
1xA10G
withTinyLlama
model (1.1B parameters).Quantization config:
There are two tables: for prefill stage and for new tokens generation stage.
Prefill simulation was done by feeding first 1024 samples from Alpaca dataset into the model and the result was averaged across them. Here only one sample at a time was sent to the model.
New token generation was done by generating 100 new tokens 100 times. The default prompt from
generate/base.py
was used.Prefill
New tokens generation
*Most likely these kernels are optimized for A100. That might explain not impressive results and low utilization.
Here one can find benchmarks made by HF team. They also show that Marlin kernel turns out to be the fastest, but not as fast as was expected.
Note
Marlin kernel only support graphics cards with compute capability >= 8.0. Here one can find a table with graphics cards and their compute capabilities.
Caveats: