[bugfix] remove unused parameters to reduce unnecessary vram usage #26789

ReinForce-II · 2025-10-14T09:19:11Z

Purpose

Solve #26788
Remove unused parameters to reduce unnecessary VRAM usage

Test Plan

VLLM_USE_FLASHINFER_MOE_FP4=1 vllm serve RedHatAI/Qwen3-30B-A3B-NVFP4

lm_eval --model local-completions \
  --model_args model=RedHatAI/Qwen3-30B-A3B-NVFP4,base_url=http://localhost:8000/v1/completions,num_concurrent=64,max_retries=3,tokenized_requests=False \
  --tasks gsm8k --batch_size auto

Test Result

Before

(EngineCore_DP0 pid=6016) INFO 10-14 08:10:58 [default_loader.py:267] Loading weights took 7.45 seconds
(EngineCore_DP0 pid=6016) INFO 10-14 08:10:59 [gpu_model_runner.py:2653] Model loading took 25.9087 GiB and 8.627268 seconds
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:08 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/00507ceb63/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:08 [backends.py:559] Dynamo bytecode transform time: 8.67 s
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:13 [backends.py:197] Cache the graph for dynamic shape for later use
(EngineCore_DP0 pid=6016) INFO 10-14 08:11:59 [backends.py:218] Compiling a graph for dynamic shape takes 5.96 s
(EngineCore_DP0 pid=6016) INFO 10-14 08:17:26 [monitor.py:34] torch.compile takes 9.64 s in total

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8795|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8741|±  |0.0091|

After

(EngineCore_DP0 pid=22452) INFO 10-14 08:38:37 [default_loader.py:267] Loading weights took 7.38 seconds
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:38 [gpu_model_runner.py:2653] Model loading took 16.9082 GiB and 8.022839 seconds
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:46 [backends.py:548] Using cache directory: /root/.cache/vllm/torch_compile_cache/9487058f29/rank_0_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:46 [backends.py:559] Dynamo bytecode transform time: 8.51 s
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:50 [backends.py:164] Directly load the compiled graph(s) for dynamic shape from the cache, took 3.216 s
(EngineCore_DP0 pid=22452) INFO 10-14 08:38:51 [monitor.py:34] torch.compile takes 8.51 s in total

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8863|±  |0.0087|
|     |       |strict-match    |     5|exact_match|↑  |0.8832|±  |0.0088|

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Reinforce-II <fate@eastal.com>

gemini-code-assist

Code Review

This pull request introduces a fix to reduce VRAM usage by deleting unused model parameters after they have been processed. Specifically, in CompressedTensorsW4A4MoeMethod, the w13_weight_packed and w2_weight_packed parameters are deleted after their data has been used to initialize w13_weight and w2_weight respectively. This is a correct and important optimization, as it allows the memory for the original packed tensors to be reclaimed, especially when the weights are subsequently reordered or repacked, which creates new tensors. The provided test results clearly demonstrate a significant reduction in VRAM consumption, confirming the effectiveness of this change. The implementation is clean and directly addresses the issue of unnecessary memory retention.

yewentao256

LGTM, do you know why we could have accuracy improvement?

ReinForce-II · 2025-10-14T15:02:27Z

do you know why we could have accuracy improvement?

This is a random occurrence; multiple evaluations show fluctuations of about 0.01.

yewentao256

LGTM, thanks for the work!

BenasdTW · 2025-10-19T08:38:40Z

This solves the issue! thanks!

…llm-project#26789) Signed-off-by: Reinforce-II <fate@eastal.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

…llm-project#26789) Signed-off-by: Reinforce-II <fate@eastal.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…llm-project#26789) Signed-off-by: Reinforce-II <fate@eastal.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

…llm-project#26789) Signed-off-by: Reinforce-II <fate@eastal.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

…llm-project#26789) Signed-off-by: Reinforce-II <fate@eastal.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

[bugfix] remove unused parameters to reduce unnecessary vram usage

28b9090

Signed-off-by: Reinforce-II <fate@eastal.com>

ReinForce-II requested review from mgoin, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 14, 2025 09:19

gemini-code-assist bot reviewed Oct 14, 2025

View reviewed changes

Merge branch 'main' into bugfix.compressed-tensors-moe-nvfp4

e8cf0fc

yewentao256 reviewed Oct 14, 2025

View reviewed changes

yewentao256 approved these changes Oct 14, 2025

View reviewed changes

yewentao256 added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025

yewentao256 approved these changes Oct 19, 2025

View reviewed changes

Merge branch 'main' into bugfix.compressed-tensors-moe-nvfp4

fe5e1fb

yewentao256 enabled auto-merge (squash) October 19, 2025 17:51

vllm-bot merged commit 980de31 into vllm-project:main Oct 22, 2025
52 of 54 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[bugfix] remove unused parameters to reduce unnecessary vram usage #26789

[bugfix] remove unused parameters to reduce unnecessary vram usage #26789

Uh oh!

ReinForce-II commented Oct 14, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yewentao256 left a comment

Uh oh!

ReinForce-II commented Oct 14, 2025 •

edited

Loading

Uh oh!

yewentao256 left a comment

Uh oh!

BenasdTW commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[bugfix] remove unused parameters to reduce unnecessary vram usage #26789

[bugfix] remove unused parameters to reduce unnecessary vram usage #26789

Uh oh!

Conversation

ReinForce-II commented Oct 14, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Before

After

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

ReinForce-II commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

BenasdTW commented Oct 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ReinForce-II commented Oct 14, 2025 •

edited by github-actions bot

Loading

ReinForce-II commented Oct 14, 2025 •

edited

Loading