[Misc] add group_size is -1 in awq quantization #18910

lengrongfu · 2025-05-29T15:28:34Z

Test success:

github-actions · 2025-05-29T15:28:42Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

lengrongfu · 2025-05-30T10:23:03Z

@DarkLight1337 please take a look, thanks ~

DarkLight1337 · 2025-05-30T10:26:55Z

I'll leave the review to @mgoin who is more qualified

mgoin

Seems reasonable to me, thanks!

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> Signed-off-by: amit <amit.man@gmail.com>

Rorschaaaach · 2025-06-03T06:09:56Z

Hi, I modified my vLLM code based on this submission. The model seems to be deployed successfully, but when I try to use it, it only responds with "！！！！！！"

Here is my vLLM launch command:
CUDA_VISIBLE_DEVICES=0 vllm serve MODEL_PATH --port xxxx --max-model-len 16384

And here is the API call I'm making:
curl http://xxxx:xxx/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "XXXX", "messages": [{"role": "user", "content": "你是谁"}], "stop": null, "stream": false }'
I’d like to ask if you encountered the same issue during your testing?

lengrongfu · 2025-06-03T06:28:56Z

Hi, I modified my vLLM code based on this submission. The model seems to be deployed successfully, but when I try to use it, it only responds with "！！！！！！"

Here is my vLLM launch command: CUDA_VISIBLE_DEVICES=0 vllm serve MODEL_PATH --port xxxx --max-model-len 16384

And here is the API call I'm making: curl http://xxxx:xxx/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "XXXX", "messages": [{"role": "user", "content": "你是谁"}], "stop": null, "stream": false }' I’d like to ask if you encountered the same issue during your testing?

I test is not question, could you please provide more information on how I can reproduce your problem?

Rorschaaaach · 2025-06-03T06:56:01Z

Hi, I modified my vLLM code based on this submission. The model seems to be deployed successfully, but when I try to use it, it only responds with "！！！！！！"
Here is my vLLM launch command: CUDA_VISIBLE_DEVICES=0 vllm serve MODEL_PATH --port xxxx --max-model-len 16384
And here is the API call I'm making: curl http://xxxx:xxx/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "XXXX", "messages": [{"role": "user", "content": "你是谁"}], "stop": null, "stream": false }' I’d like to ask if you encountered the same issue during your testing?

I test is not question, could you please provide more information on how I can reproduce your problem?

I used AutoRound for quantization. Here is my command:
`model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-32B-Instruct",
attn_implementation="flash_attention_2",
device_map="cuda",
torch_dtype="auto"
)

autoround = AutoRound(
model,
tokenizer,
dataset=data,
seqlen=4096,
nsamples=128,
batch_size=16,
low_gpu_mem_usage=True,
bits=4,
group_size=-1,
sym=False,
)

autoround.quantize_and_save(output_dir, format='auto_awq')`
Are you using AutoAWQ for quantization? Could you share your quantization command with me?
I'd like to try it out and see whether the issue is caused by the quantization framework.

lengrongfu · 2025-06-03T07:16:04Z

#18885 (comment)

Rorschaaaach · 2025-06-04T03:16:39Z

#18885 (comment)

After referring to the content here the model answered correctly. But, i found that the speed of different models after quantization varies greatly.
This is the speed of the unquantized model of Qwen3-0.6B

This is the speed of the group_size=-1 awq model of Qwen3-0.6B

This is the speed of the unquantized model of Qwen2.5-32B

This is the speed of the group_size=-1 awq model of Qwen2.5-32B

Tested on an A800 80G

lengrongfu · 2025-06-04T03:18:51Z

@Rorschaaaach Is your problem that the model doesn't work properly or that the model performance is slow?

Rorschaaaach · 2025-06-04T03:24:17Z

@Rorschaaaach Is your problem that the model doesn't work properly or that the model performance is slow?

My model didn't work properly at first, but I re-quantized my model according to your instructions, and then the model worked properly.

Now I find that the model is very slow to answer after it works properly.

[Misc] add group_size is -1 in awq quantization

33ba0ad

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>

lengrongfu force-pushed the feat/new1-use-autoweights branch from 774b2d1 to 33ba0ad Compare May 30, 2025 02:49

lengrongfu marked this pull request as ready for review May 30, 2025 02:51

lengrongfu requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners May 30, 2025 02:51

mgoin approved these changes May 30, 2025

View reviewed changes

mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels May 30, 2025

mgoin enabled auto-merge (squash) May 30, 2025 12:32

mgoin merged commit 7f21e80 into vllm-project:main May 30, 2025
79 checks passed

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[Misc] add group_size is -1 in awq quantization (vllm-project#18910)

e7d7961

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> Signed-off-by: amit <amit.man@gmail.com>

amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025

[Misc] add group_size is -1 in awq quantization (vllm-project#18910)

ba017df

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io> Signed-off-by: amit <amit.man@gmail.com>

Uh oh!

[Misc] add group_size is -1 in awq quantization #18910

[Misc] add group_size is -1 in awq quantization #18910

Uh oh!

Conversation

lengrongfu commented May 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 29, 2025

Uh oh!

lengrongfu commented May 30, 2025

Uh oh!

DarkLight1337 commented May 30, 2025

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rorschaaaach commented Jun 3, 2025

Uh oh!

lengrongfu commented Jun 3, 2025

Uh oh!

Rorschaaaach commented Jun 3, 2025

Uh oh!

lengrongfu commented Jun 3, 2025

Uh oh!

Rorschaaaach commented Jun 4, 2025

Uh oh!

lengrongfu commented Jun 4, 2025

Uh oh!

Rorschaaaach commented Jun 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lengrongfu commented May 29, 2025 •

edited by github-actions bot

Loading