Deepseek Mtp model uses the lm_head and embedding from the main model #2790

zzhx1 · 2025-09-05T16:37:45Z

What this PR does / why we need it?

In the Deepseek technical report, it is mentioned that the embedding and lmhead layers of the MTP layer are shared with the main model, but the current implementation independently loads the complete embedding and lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in fp16 format, which is 1.72G.

This PR fixes the MTP layer to use the lmhead and embedding of the main model, saving 3.45G of GPU memory in the pure DP scenario.

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

github-actions · 2025-09-05T16:37:53Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request correctly implements the sharing of lm_head and embed_tokens for the Deepseek MTP model, which will result in significant GPU memory savings. The changes in vllm_ascend/torchair/models/torchair_deepseek_mtp.py correctly modify the MTP model to not initialize its own lm_head and embed_tokens, and to skip loading their weights.

However, I've found a critical issue in vllm_ascend/spec_decode/mtp_proposer.py where the shared embed_tokens module is assigned incorrectly. This would lead to a runtime error. I've provided a detailed comment and a code suggestion to fix this. Once that is addressed, this PR should be good to go.

gemini-code-assist · 2025-09-05T16:39:47Z

vllm_ascend/spec_decode/mtp_proposer.py

+        for layer_name, layer_module in self.model.model.layers.items():
+            layer_module.embed_tokens = main_embed_tokens
+            layer_module.shared_head.head = main_lm_head


There appears to be a bug in how the shared embed_tokens module is assigned. The code assigns main_embed_tokens to layer_module.embed_tokens for each layer in a loop. However, the embed_tokens module is an attribute of the parent TorchairDeepSeekMultiTokenPredictor (or CustomDeepSeekMultiTokenPredictor) module, not the individual layer modules. The forward method of the predictor calls self.embed_tokens, which will be None as it is never updated, leading to a TypeError.

The embed_tokens should be assigned to self.model.model.embed_tokens once, before the loop. The loop is only necessary for assigning the main_lm_head to each layer's shared_head.head. I've also simplified the loop to use .values() as the key is not used.

Suggested change

for layer_name, layer_module in self.model.model.layers.items():

layer_module.embed_tokens = main_embed_tokens

layer_module.shared_head.head = main_lm_head

self.model.model.embed_tokens = main_embed_tokens

for layer_module in self.model.model.layers.values():

layer_module.shared_head.head = main_lm_head

zzhx1 · 2025-09-05T16:42:01Z

vllm_ascend/torchair/models/torchair_deepseek_mtp.py

+            if "rotary_emb.inv_freq" in name:
+                continue
+            # The mtp model uses the lm_head and embedding from the main model.
+            if "shared_head" in name:


Added code to skip loading of lmhead and embed, can only rewrite load_weight

codecov · 2025-09-06T04:23:35Z

Codecov Report

❌ Patch coverage is 37.31343% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.50%. Comparing base (4c90fa7) to head (4659fc7).
⚠️ Report is 9 commits behind head on main.

Files with missing lines	Patch %	Lines
...lm_ascend/torchair/models/torchair_deepseek_mtp.py	33.92%	37 Missing ⚠️
vllm_ascend/spec_decode/mtp_proposer.py	16.66%	5 Missing ⚠️

❌ Your patch check has failed because the patch coverage (37.31%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2790      +/-   ##
==========================================
- Coverage   72.99%   72.50%   -0.50%     
==========================================
  Files         153      153              
  Lines       21331    21340       +9     
==========================================
- Hits        15571    15472      -99     
- Misses       5760     5868     +108

Flag	Coverage Δ
unittests	`72.50% <37.31%> (-0.50%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

momo609 · 2025-09-14T10:13:58Z

vllm_ascend/spec_decode/mtp_proposer.py

        process_weights_after_loading(self.model, draft_model_config,
                                      target_device)

+        main_embed_tokens = main_model.model.embed_tokens


The absence of redundant lm_head invocations in the dummy run it may lead to deadlocks during DP load imbalance scenarios?

github-actions · 2025-09-18T03:40:36Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Signed-off-by: zzhxx <2783294813@qq.com>

zzhx1 · 2025-10-20T08:59:00Z

@ApsarasX please add a ready-to-test lable

Signed-off-by: zzhxx <2783294813@qq.com>

wangxiyuan · 2025-10-21T08:19:34Z

vllm_ascend/spec_decode/mtp_proposer.py

        process_weights_after_loading(self.model, draft_model_config,
                                      target_device)

+        # use main model's embedding and LMhead


let' wait #3528 merge first. Then mtp change can be in torchair module as well.

linfeng-yuan · 2025-10-21T09:31:55Z

vllm_ascend/spec_decode/mtp_proposer.py

+        # use main model's embedding and LMhead
+        self.model.model.embed_tokens = main_model.model.embed_tokens
+        for layer_module in self.model.model.layers.values():
+            layer_module.shared_head.head = main_model.lm_head


In the quantized DeepSeek weights provided by the Ascend community that address function call accuracy, the weight values of lm_head and shared_head differ numerically. When using these new quantized weights, this code line will result in precision degradation.

Please have a check whether there is an elegant option to recognize these two weights. @wangxiyuan

对比 lm_head.weight (文件: quant_model_weight_w8a8_dynamic-00158-of-00162.safetensors) vs model.layers.61.shared_head.head.weight (文件: quant_model_weight_w8a8_dynamic-00161-of-00162.safetensors): 最大差值=0.000000, 百分比误差=0.0000%, 是否在1e-4范围内=True 对比 model.embed_tokens.weight (文件: quant_model_weight_w8a8_dynamic-00001-of-00162.safetensors) vs model.layers.61.embed_tokens.weight (文件: quant_model_weight_w8a8_dynamic-00162-of-00162.safetensors): 最大差值=0.000000, 百分比误差=0.0000%, 是否在1e-4范围内=True

I tested the deepseek model files myself and found that the weights of lmhead and shared_head are exactly the same, and the embedding weights are also the same.
And it has been tested without affecting the acceptance rate of the MTP layer

https://modelers.cn/models/Modelers_Park/DeepSeek-V3.1-w8a8-function_call
Modelslim has updated new quantization option --rot to fix function calling accuracy for quantized deepseek-like models. And re-using the lm_head and shared_head would lead to acceptance rate degradation for deepseek_r1 (~2% which can be ignored). The developers of ModelSlim do not recommend the re-usage behaviour becuase of robustness issues. Could you please add a verification here to check the numerical value of lm_head and shared_head weights?

As far as I know, MindIE adds a weight validation here to determine which we can re-use the lm_head of non-mtp part.

github-actions · 2025-10-21T12:17:50Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

gemini-code-assist bot reviewed Sep 5, 2025

View reviewed changes

zzhx1 commented Sep 5, 2025

View reviewed changes

zzhx1 force-pushed the mtp-memory-opt branch from a6f8814 to 04f1c3c Compare September 6, 2025 04:07

github-actions bot added the module:tests label Sep 6, 2025

zzhx1 force-pushed the mtp-memory-opt branch from 028548a to 4659fc7 Compare September 7, 2025 02:53

momo609 reviewed Sep 14, 2025

View reviewed changes

zzhx1 force-pushed the mtp-memory-opt branch from 4659fc7 to 8ab2239 Compare September 18, 2025 03:40

github-actions bot removed the module:tests label Sep 18, 2025

github-actions bot added the merge-conflicts label Sep 18, 2025

zzhx1 force-pushed the mtp-memory-opt branch from 8ab2239 to 4f2c0e6 Compare September 18, 2025 03:45

zzhx1 force-pushed the mtp-memory-opt branch from 4f2c0e6 to 7808c35 Compare September 25, 2025 11:59

github-actions bot removed the merge-conflicts label Sep 25, 2025

zzhx1 force-pushed the mtp-memory-opt branch from 7808c35 to a638cff Compare October 17, 2025 07:42

zzhx1 added 2 commits October 17, 2025 15:56

mtp model uses the lm_head and embedding from the main model

f0f6a42

Signed-off-by: zzhxx <2783294813@qq.com>

[CI] fix

837b73a

Signed-off-by: zzhxx <2783294813@qq.com>

zzhx1 force-pushed the mtp-memory-opt branch from a638cff to 837b73a Compare October 17, 2025 07:57

ApsarasX added the ready-for-test start test by label for PR label Oct 20, 2025

fix

fc5a95a

Signed-off-by: zzhxx <2783294813@qq.com>

wangxiyuan reviewed Oct 21, 2025

View reviewed changes

linfeng-yuan suggested changes Oct 21, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Oct 21, 2025

Deepseek Mtp model uses the lm_head and embedding from the main model #2790

Are you sure you want to change the base?

Deepseek Mtp model uses the lm_head and embedding from the main model #2790

Conversation

zzhx1 commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

zzhx1 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Sep 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

momo609 Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 18, 2025

Uh oh!

zzhx1 commented Oct 20, 2025

Uh oh!

wangxiyuan Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

linfeng-yuan Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

zzhx1 Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

linfeng-yuan Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

linfeng-yuan Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zzhx1 commented Sep 5, 2025 •

edited by github-actions bot

Loading

codecov bot commented Sep 6, 2025 •

edited

Loading

linfeng-yuan Oct 24, 2025 •

edited

Loading