Skip to content

Conversation

@zzhx1
Copy link
Contributor

@zzhx1 zzhx1 commented Sep 5, 2025

What this PR does / why we need it?

In the Deepseek technical report, it is mentioned that the embedding and lmhead layers of the MTP layer are shared with the main model, but the current implementation independently loads the complete embedding and lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in fp16 format, which is 1.72G.

This PR fixes the MTP layer to use the lmhead and embedding of the main model, saving 3.45G of GPU memory in the pure DP scenario.

@github-actions
Copy link

github-actions bot commented Sep 5, 2025

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly implements the sharing of lm_head and embed_tokens for the Deepseek MTP model, which will result in significant GPU memory savings. The changes in vllm_ascend/torchair/models/torchair_deepseek_mtp.py correctly modify the MTP model to not initialize its own lm_head and embed_tokens, and to skip loading their weights.

However, I've found a critical issue in vllm_ascend/spec_decode/mtp_proposer.py where the shared embed_tokens module is assigned incorrectly. This would lead to a runtime error. I've provided a detailed comment and a code suggestion to fix this. Once that is addressed, this PR should be good to go.

Comment on lines 105 to 107
for layer_name, layer_module in self.model.model.layers.items():
layer_module.embed_tokens = main_embed_tokens
layer_module.shared_head.head = main_lm_head
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a bug in how the shared embed_tokens module is assigned. The code assigns main_embed_tokens to layer_module.embed_tokens for each layer in a loop. However, the embed_tokens module is an attribute of the parent TorchairDeepSeekMultiTokenPredictor (or CustomDeepSeekMultiTokenPredictor) module, not the individual layer modules. The forward method of the predictor calls self.embed_tokens, which will be None as it is never updated, leading to a TypeError.

The embed_tokens should be assigned to self.model.model.embed_tokens once, before the loop. The loop is only necessary for assigning the main_lm_head to each layer's shared_head.head. I've also simplified the loop to use .values() as the key is not used.

Suggested change
for layer_name, layer_module in self.model.model.layers.items():
layer_module.embed_tokens = main_embed_tokens
layer_module.shared_head.head = main_lm_head
self.model.model.embed_tokens = main_embed_tokens
for layer_module in self.model.model.layers.values():
layer_module.shared_head.head = main_lm_head

if "rotary_emb.inv_freq" in name:
continue
# The mtp model uses the lm_head and embedding from the main model.
if "shared_head" in name:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code to skip loading of lmhead and embed, can only rewrite load_weight

@codecov
Copy link

codecov bot commented Sep 6, 2025

Codecov Report

❌ Patch coverage is 37.31343% with 42 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.50%. Comparing base (4c90fa7) to head (4659fc7).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...lm_ascend/torchair/models/torchair_deepseek_mtp.py 33.92% 37 Missing ⚠️
vllm_ascend/spec_decode/mtp_proposer.py 16.66% 5 Missing ⚠️

❌ Your patch check has failed because the patch coverage (37.31%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2790      +/-   ##
==========================================
- Coverage   72.99%   72.50%   -0.50%     
==========================================
  Files         153      153              
  Lines       21331    21340       +9     
==========================================
- Hits        15571    15472      -99     
- Misses       5760     5868     +108     
Flag Coverage Δ
unittests 72.50% <37.31%> (-0.50%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

process_weights_after_loading(self.model, draft_model_config,
target_device)

main_embed_tokens = main_model.model.embed_tokens
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The absence of redundant lm_head invocations in the dummy run it may lead to deadlocks during DP load imbalance scenarios?

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zzhx1 added 2 commits October 17, 2025 15:56
Signed-off-by: zzhxx <2783294813@qq.com>
Signed-off-by: zzhxx <2783294813@qq.com>
@zzhx1
Copy link
Contributor Author

zzhx1 commented Oct 20, 2025

@ApsarasX please add a ready-to-test lable

@ApsarasX ApsarasX added the ready-for-test start test by label for PR label Oct 20, 2025
Signed-off-by: zzhxx <2783294813@qq.com>
process_weights_after_loading(self.model, draft_model_config,
target_device)

# use main model's embedding and LMhead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let' wait #3528 merge first. Then mtp change can be in torchair module as well.

# use main model's embedding and LMhead
self.model.model.embed_tokens = main_model.model.embed_tokens
for layer_module in self.model.model.layers.values():
layer_module.shared_head.head = main_model.lm_head
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the quantized DeepSeek weights provided by the Ascend community that address function call accuracy, the weight values of lm_head and shared_head differ numerically. When using these new quantized weights, this code line will result in precision degradation.

Please have a check whether there is an elegant option to recognize these two weights. @wangxiyuan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对比 lm_head.weight (文件: quant_model_weight_w8a8_dynamic-00158-of-00162.safetensors) vs model.layers.61.shared_head.head.weight (文件: quant_model_weight_w8a8_dynamic-00161-of-00162.safetensors):
    最大差值=0.000000, 百分比误差=0.0000%, 是否在1e-4范围内=True
对比 model.embed_tokens.weight (文件: quant_model_weight_w8a8_dynamic-00001-of-00162.safetensors) vs model.layers.61.embed_tokens.weight (文件: quant_model_weight_w8a8_dynamic-00162-of-00162.safetensors):
    最大差值=0.000000, 百分比误差=0.0000%, 是否在1e-4范围内=True
image

I tested the deepseek model files myself and found that the weights of lmhead and shared_head are exactly the same, and the embedding weights are also the same.
And it has been tested without affecting the acceptance rate of the MTP layer

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://modelers.cn/models/Modelers_Park/DeepSeek-V3.1-w8a8-function_call
Modelslim has updated new quantization option --rot to fix function calling accuracy for quantized deepseek-like models. And re-using the lm_head and shared_head would lead to acceptance rate degradation for deepseek_r1 (~2% which can be ignored). The developers of ModelSlim do not recommend the re-usage behaviour becuase of robustness issues. Could you please add a verification here to check the numerical value of lm_head and shared_head weights?

Copy link
Collaborator

@linfeng-yuan linfeng-yuan Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know, MindIE adds a weight validation here to determine which we can re-use the lm_head of non-mtp part.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-conflicts ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants