Skip to content

Conversation

@ReinForce-II
Copy link
Contributor

@ReinForce-II ReinForce-II commented Oct 14, 2025

Purpose

Solve #26788
Remove unused parameters to reduce unnecessary VRAM usage

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve RedHatAI/Qwen3-30B-A3B-NVFP4
lm_eval --model local-completions \
  --model_args model=RedHatAI/Qwen3-30B-A3B-NVFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --batch_size auto

Test Result

Before

(EngineCore_DP0 pid=6016) INFO 10-14 08:10:58 [default_loader.py:267] Loading weights took 7.45 seconds
(EngineCore_DP0 pid=6016) INFO 10-14 08:10:59 [gpu_model_runner.py:2653] Model loading took 25.9087 GiB and 8.627268 seconds
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:08 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/00507ceb63/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:08 [backends.py:559] Dynamo bytecode transform time: 8.67 s
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:13 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:59 [backends.py:218] Compiling a graph for dynamic shape takes 5.96 s
(EngineCore_DP0 pid=6016) INFO 10-14 08:17:26 [monitor.py:34] torch.compile takes 9.64 s in total
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8795|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8741|±  |0.0091|

After

(EngineCore_DP0 pid=22452) INFO 10-14 08:38:37 [default_loader.py:267] Loading weights took 7.38 seconds
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:38 [gpu_model_runner.py:2653] Model loading took 16.9082 GiB and 8.022839 seconds
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:46 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9487058f29/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:46 [backends.py:559] Dynamo bytecode transform time: 8.51 s
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:50 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.216 s
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:51 [monitor.py:34] torch.compile takes 8.51 s in total
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8863|±  |0.0087|
|     |       |strict-match    |     5|exact_match|↑  |0.8832|±  |0.0088|

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Reinforce-II <fate@eastal.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a fix to reduce VRAM usage by deleting unused model parameters after they have been processed. Specifically, in CompressedTensorsW4A4MoeMethod, the w13_weight_packed and w2_weight_packed parameters are deleted after their data has been used to initialize w13_weight and w2_weight respectively. This is a correct and important optimization, as it allows the memory for the original packed tensors to be reclaimed, especially when the weights are subsequently reordered or repacked, which creates new tensors. The provided test results clearly demonstrate a significant reduction in VRAM consumption, confirming the effectiveness of this change. The implementation is clean and directly addresses the issue of unnecessary memory retention.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, do you know why we could have accuracy improvement?

@ReinForce-II
Copy link
Contributor Author

ReinForce-II commented Oct 14, 2025

do you know why we could have accuracy improvement?

This is a random occurrence; multiple evaluations show fluctuations of about 0.01.

Copy link
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the work!

@yewentao256 yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025
@BenasdTW
Copy link

This solves the issue! thanks!

@yewentao256 yewentao256 enabled auto-merge (squash) October 19, 2025 17:51
@vllm-bot vllm-bot merged commit 980de31 into vllm-project:main Oct 22, 2025
52 of 54 checks passed
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
ilmarkov pushed a commit to neuralmagic/vllm that referenced this pull request Nov 7, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025
…llm-project#26789)

Signed-off-by: Reinforce-II <fate@eastal.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants