-
Notifications
You must be signed in to change notification settings - Fork 543
Deepseek Mtp model uses the lm_head and embedding from the main model #2790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly implements the sharing of lm_head and embed_tokens for the Deepseek MTP model, which will result in significant GPU memory savings. The changes in vllm_ascend/torchair/models/torchair_deepseek_mtp.py correctly modify the MTP model to not initialize its own lm_head and embed_tokens, and to skip loading their weights.
However, I've found a critical issue in vllm_ascend/spec_decode/mtp_proposer.py where the shared embed_tokens module is assigned incorrectly. This would lead to a runtime error. I've provided a detailed comment and a code suggestion to fix this. Once that is addressed, this PR should be good to go.
| for layer_name, layer_module in self.model.model.layers.items(): | ||
| layer_module.embed_tokens = main_embed_tokens | ||
| layer_module.shared_head.head = main_lm_head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There appears to be a bug in how the shared embed_tokens module is assigned. The code assigns main_embed_tokens to layer_module.embed_tokens for each layer in a loop. However, the embed_tokens module is an attribute of the parent TorchairDeepSeekMultiTokenPredictor (or CustomDeepSeekMultiTokenPredictor) module, not the individual layer modules. The forward method of the predictor calls self.embed_tokens, which will be None as it is never updated, leading to a TypeError.
The embed_tokens should be assigned to self.model.model.embed_tokens once, before the loop. The loop is only necessary for assigning the main_lm_head to each layer's shared_head.head. I've also simplified the loop to use .values() as the key is not used.
| for layer_name, layer_module in self.model.model.layers.items(): | |
| layer_module.embed_tokens = main_embed_tokens | |
| layer_module.shared_head.head = main_lm_head | |
| self.model.model.embed_tokens = main_embed_tokens | |
| for layer_module in self.model.model.layers.values(): | |
| layer_module.shared_head.head = main_lm_head |
| if "rotary_emb.inv_freq" in name: | ||
| continue | ||
| # The mtp model uses the lm_head and embedding from the main model. | ||
| if "shared_head" in name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added code to skip loading of lmhead and embed, can only rewrite load_weight
a6f8814 to
04f1c3c
Compare
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (37.31%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2790 +/- ##
==========================================
- Coverage 72.99% 72.50% -0.50%
==========================================
Files 153 153
Lines 21331 21340 +9
==========================================
- Hits 15571 15472 -99
- Misses 5760 5868 +108
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
028548a to
4659fc7
Compare
| process_weights_after_loading(self.model, draft_model_config, | ||
| target_device) | ||
|
|
||
| main_embed_tokens = main_model.model.embed_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The absence of redundant lm_head invocations in the dummy run it may lead to deadlocks during DP load imbalance scenarios?
4659fc7 to
8ab2239
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
8ab2239 to
4f2c0e6
Compare
4f2c0e6 to
7808c35
Compare
Signed-off-by: zzhxx <2783294813@qq.com>
|
@ApsarasX please add a ready-to-test lable |
| process_weights_after_loading(self.model, draft_model_config, | ||
| target_device) | ||
|
|
||
| # use main model's embedding and LMhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let' wait #3528 merge first. Then mtp change can be in torchair module as well.
| # use main model's embedding and LMhead | ||
| self.model.model.embed_tokens = main_model.model.embed_tokens | ||
| for layer_module in self.model.model.layers.values(): | ||
| layer_module.shared_head.head = main_model.lm_head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the quantized DeepSeek weights provided by the Ascend community that address function call accuracy, the weight values of lm_head and shared_head differ numerically. When using these new quantized weights, this code line will result in precision degradation.
Please have a check whether there is an elegant option to recognize these two weights. @wangxiyuan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对比 lm_head.weight (文件: quant_model_weight_w8a8_dynamic-00158-of-00162.safetensors) vs model.layers.61.shared_head.head.weight (文件: quant_model_weight_w8a8_dynamic-00161-of-00162.safetensors):
最大差值=0.000000, 百分比误差=0.0000%, 是否在1e-4范围内=True
对比 model.embed_tokens.weight (文件: quant_model_weight_w8a8_dynamic-00001-of-00162.safetensors) vs model.layers.61.embed_tokens.weight (文件: quant_model_weight_w8a8_dynamic-00162-of-00162.safetensors):
最大差值=0.000000, 百分比误差=0.0000%, 是否在1e-4范围内=True
I tested the deepseek model files myself and found that the weights of lmhead and shared_head are exactly the same, and the embedding weights are also the same.
And it has been tested without affecting the acceptance rate of the MTP layer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://modelers.cn/models/Modelers_Park/DeepSeek-V3.1-w8a8-function_call
Modelslim has updated new quantization option --rot to fix function calling accuracy for quantized deepseek-like models. And re-using the lm_head and shared_head would lead to acceptance rate degradation for deepseek_r1 (~2% which can be ignored). The developers of ModelSlim do not recommend the re-usage behaviour becuase of robustness issues. Could you please add a verification here to check the numerical value of lm_head and shared_head weights?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know, MindIE adds a weight validation here to determine which we can re-use the lm_head of non-mtp part.
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
In the Deepseek technical report, it is mentioned that the embedding and lmhead layers of the MTP layer are shared with the main model, but the current implementation independently loads the complete embedding and lmhead. In the Deepseek-R1 model, their weight sizes are 129280*7168 in fp16 format, which is 1.72G.
This PR fixes the MTP layer to use the lmhead and embedding of the main model, saving 3.45G of GPU memory in the pure DP scenario.