Add AWQ 4bit inference support #2103

ys-2020 · 2023-07-28T03:37:40Z

Why are these changes needed?

In this PR we introduce AWQ 4bit inference to FastChat. AWQ supports efficient and accurate low-bit weight quantization for LLMs, and helps to run larger language models within the device memory restriction and prominently accelerates token generation. For example, when running LLaMA-2-7b, AWQ delivers 2.3x and 1.4x speedups over the FP16 baseline on RTX 4090 and Jetson Orin, respectively.

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

ys-2020 and others added 9 commits July 25, 2023 06:17

[Major] integrate AWQ into fastchat

bb3a343

[Minor] modify code

7d13b30

[Major] add doc for awq

79c9b89

[Minor] update README.md

03aa0b5

Update README by Jiaming

ca53b0d

Update awq doc

d2672fa

[Minor] Update awq.md

e2df916

[Minor] code clean

f076156

[Minor] code reformat

c615533

merrymercy merged commit 6680b68 into lm-sys:main Aug 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AWQ 4bit inference support #2103

Add AWQ 4bit inference support #2103

ys-2020 commented Jul 28, 2023 •

edited

Loading

Add AWQ 4bit inference support #2103

Add AWQ 4bit inference support #2103

Conversation

ys-2020 commented Jul 28, 2023 • edited Loading

Why are these changes needed?

Checks

ys-2020 commented Jul 28, 2023 •

edited

Loading