Skip to content

Conversation

@lengrongfu
Copy link
Contributor

@lengrongfu lengrongfu commented May 29, 2025

FIX #18885

Test success:

#18885 (comment)

image

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
@lengrongfu lengrongfu force-pushed the feat/new1-use-autoweights branch from 774b2d1 to 33ba0ad Compare May 30, 2025 02:49
@lengrongfu lengrongfu marked this pull request as ready for review May 30, 2025 02:51
@lengrongfu
Copy link
Contributor Author

@DarkLight1337 please take a look, thanks ~

@DarkLight1337
Copy link
Member

I'll leave the review to @mgoin who is more qualified

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable to me, thanks!

@mgoin mgoin added quantization ready ONLY add when PR is ready to merge/full CI is needed labels May 30, 2025
@mgoin mgoin enabled auto-merge (squash) May 30, 2025 12:32
@mgoin mgoin merged commit 7f21e80 into vllm-project:main May 30, 2025
79 checks passed
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: amit <amit.man@gmail.com>
amitm02 pushed a commit to amitm02/vllm that referenced this pull request Jun 1, 2025
Signed-off-by: rongfu.leng <rongfu.leng@daocloud.io>
Signed-off-by: amit <amit.man@gmail.com>
@Rorschaaaach
Copy link

Hi, I modified my vLLM code based on this submission. The model seems to be deployed successfully, but when I try to use it, it only responds with "!!!!!!"

Here is my vLLM launch command:
CUDA_VISIBLE_DEVICES=0 vllm serve MODEL_PATH --port xxxx --max-model-len 16384

And here is the API call I'm making:
curl http://xxxx:xxx/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "XXXX", "messages": [{"role": "user", "content": "你是谁"}], "stop": null, "stream": false }'
I’d like to ask if you encountered the same issue during your testing?

@lengrongfu
Copy link
Contributor Author

Hi, I modified my vLLM code based on this submission. The model seems to be deployed successfully, but when I try to use it, it only responds with "!!!!!!"

Here is my vLLM launch command: CUDA_VISIBLE_DEVICES=0 vllm serve MODEL_PATH --port xxxx --max-model-len 16384

And here is the API call I'm making: curl http://xxxx:xxx/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "XXXX", "messages": [{"role": "user", "content": "你是谁"}], "stop": null, "stream": false }' I’d like to ask if you encountered the same issue during your testing?

image

I test is not question, could you please provide more information on how I can reproduce your problem?

@Rorschaaaach
Copy link

Hi, I modified my vLLM code based on this submission. The model seems to be deployed successfully, but when I try to use it, it only responds with "!!!!!!"
Here is my vLLM launch command: CUDA_VISIBLE_DEVICES=0 vllm serve MODEL_PATH --port xxxx --max-model-len 16384
And here is the API call I'm making: curl http://xxxx:xxx/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "XXXX", "messages": [{"role": "user", "content": "你是谁"}], "stop": null, "stream": false }' I’d like to ask if you encountered the same issue during your testing?

image

I test is not question, could you please provide more information on how I can reproduce your problem?

I used AutoRound for quantization. Here is my command:
`model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-32B-Instruct",
attn_implementation="flash_attention_2",
device_map="cuda",
torch_dtype="auto"
)

autoround = AutoRound(
model,
tokenizer,
dataset=data,
seqlen=4096,
nsamples=128,
batch_size=16,
low_gpu_mem_usage=True,
bits=4,
group_size=-1,
sym=False,
)

autoround.quantize_and_save(output_dir, format='auto_awq')`
Are you using AutoAWQ for quantization? Could you share your quantization command with me?
I'd like to try it out and see whether the issue is caused by the quantization framework.

@lengrongfu
Copy link
Contributor Author

#18885 (comment)

@Rorschaaaach
Copy link

#18885 (comment)

After referring to the content here the model answered correctly. But, i found that the speed of different models after quantization varies greatly.
This is the speed of the unquantized model of Qwen3-0.6B
image
This is the speed of the group_size=-1 awq model of Qwen3-0.6B
image
This is the speed of the unquantized model of Qwen2.5-32B
image
This is the speed of the group_size=-1 awq model of Qwen2.5-32B
image
Tested on an A800 80G

@lengrongfu
Copy link
Contributor Author

@Rorschaaaach Is your problem that the model doesn't work properly or that the model performance is slow?

@Rorschaaaach
Copy link

@Rorschaaaach Is your problem that the model doesn't work properly or that the model performance is slow?

My model didn't work properly at first, but I re-quantized my model according to your instructions, and then the model worked properly.

Now I find that the model is very slow to answer after it works properly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: AWQ INT4 Model with group_size=-1 throws exception while gptq format is fine

4 participants