Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AWQ 4bit inference support #2103

Merged
merged 9 commits into from
Aug 1, 2023
Merged

Add AWQ 4bit inference support #2103

merged 9 commits into from
Aug 1, 2023

Conversation

ys-2020
Copy link
Contributor

@ys-2020 ys-2020 commented Jul 28, 2023

Why are these changes needed?

In this PR we introduce AWQ 4bit inference to FastChat. AWQ supports efficient and accurate low-bit weight quantization for LLMs, and helps to run larger language models within the device memory restriction and prominently accelerates token generation. For example, when running LLaMA-2-7b, AWQ delivers 2.3x and 1.4x speedups over the FP16 baseline on RTX 4090 and Jetson Orin, respectively.

Checks

  • I've run format.sh to lint the changes in this PR.
  • I've included any doc changes needed.
  • I've made sure the relevant tests are passing (if applicable).

@merrymercy merrymercy merged commit 6680b68 into lm-sys:main Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants