-
Notifications
You must be signed in to change notification settings - Fork 544
[Feat]: Add custom lmhead tensor model parallel #2309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lidenghui1110, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've introduced a custom implementation for LMhead tensor model parallelism, primarily targeting Deepseek models on Ascend NPUs. The main purpose of this change is to significantly reduce memory consumption and improve performance, specifically Tokens Per Second Output Throughput (TPOT), in pure Data Parallel (DP) scenarios. This feature supports both eager and graph modes, providing flexibility in deployment.
Highlights
- New Configuration Parameter: A new configuration parameter,
lmhead_tensor_parallel_size, has been added toadditional_configinvllm_ascend/ascend_config.py, allowing users to specify the degree of LMhead tensor parallelism. - Dedicated Parallel Group: A dedicated
_LMTP(LMhead Tensor Parallel) group coordinator has been introduced invllm_ascend/distributed/parallel_state.pyto manage the communication for LMhead tensor parallelism. - Custom LMHead and Logits Processing: Custom implementations of
ParallelLMHeadandLogitsProcessor, namedCustomParallelLMHeadandCustomLogitsProcessor, have been added invllm_ascend/ops/vocab_parallel_embedding.py. These classes handle the distributed computation of logits, including gathering hidden states and scattering logits across the LMhead tensor parallel group. - Integration with VLLM Config: The
lmhead_tensor_parallel_sizeis now propagated fromAscendConfigtoVllmConfig'sParallelConfiginvllm_ascend/platform.py, ensuring the configuration is accessible throughout the system. - Runtime Adjustments: Adjustments have been made in
vllm_ascend/worker/model_runner_v1.pyto correctly handle sample indices and logits during model execution and dummy runs when LMhead tensor parallelism is enabled, ensuring proper data alignment across parallel ranks. - Constraint Enforcement: Assertions have been added to enforce that
lmhead_tensor_parallel_sizeis only enabled in pure Data Parallel (DP) scenarios (wheretensor_parallel_size == 1) and when graph mode is enabled, ensuring correct usage and preventing misconfigurations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request introduces LM head tensor parallelism for pure DP scenarios on Ascend NPUs. The changes look good overall, but I've found a critical bug related to an undefined variable and some opportunities for code improvement, including removing duplicated code and adhering to Python styling conventions.
| if _enable_lmhead_tp(): # | ||
| if not with_prefill: | ||
| max_num_reqs_across_dp = padded_num_tokens_across_dp | ||
| else: | ||
| max_num_reqs_across_dp = self.max_num_reqs | ||
| sample_indices = nn.functional.pad( | ||
| sample_indices, | ||
| (0, max_num_reqs_across_dp - sample_indices.shape[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable sample_indices is used here but it has not been defined anywhere in this function, which will cause a NameError at runtime. Based on the context and the function's return values, it seems this should be logits_indices.
Additionally, the comment on line 1338 has a trailing # which appears to be a typo.
| if _enable_lmhead_tp(): # | |
| if not with_prefill: | |
| max_num_reqs_across_dp = padded_num_tokens_across_dp | |
| else: | |
| max_num_reqs_across_dp = self.max_num_reqs | |
| sample_indices = nn.functional.pad( | |
| sample_indices, | |
| (0, max_num_reqs_across_dp - sample_indices.shape[0])) | |
| if _enable_lmhead_tp(): | |
| if not with_prefill: | |
| max_num_reqs_across_dp = padded_num_tokens_across_dp | |
| else: | |
| max_num_reqs_across_dp = self.max_num_reqs | |
| logits_indices = nn.functional.pad( | |
| logits_indices, | |
| (0, max_num_reqs_across_dp - logits_indices.shape[0])) |
vllm_ascend/ascend_config.py
Outdated
| assert( | ||
| vllm_config.parallel_config.tensor_parallel_size == 1 | ||
| ),"lmhead_tensor_parallel_size is only supported in the pure DP scenario" | ||
| assert( | ||
| self.torchair_graph_config.enabled == True | ||
| ), "lmhead_tensor_parallel_size is only supported in graph mode" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assert statements can be simplified for better readability and to follow standard Python conventions. The parentheses around the assertion condition are unnecessary, and == True can be omitted for boolean checks.
| assert( | |
| vllm_config.parallel_config.tensor_parallel_size == 1 | |
| ),"lmhead_tensor_parallel_size is only supported in the pure DP scenario" | |
| assert( | |
| self.torchair_graph_config.enabled == True | |
| ), "lmhead_tensor_parallel_size is only supported in graph mode" | |
| assert vllm_config.parallel_config.tensor_parallel_size == 1, ( | |
| "lmhead_tensor_parallel_size is only supported in the pure DP scenario" | |
| ) | |
| assert self.torchair_graph_config.enabled, ( | |
| "lmhead_tensor_parallel_size is only supported in graph mode" | |
| ) |
vllm_ascend/utils.py
Outdated
| def _enable_lmhead_tp() -> bool: | ||
| if get_ascend_config().lmhead_tensor_parallel_size is not None: | ||
| return True | ||
| return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| if not self.in_profile_run and _enable_lmhead_tp(): | ||
| if not with_prefill: | ||
| max_num_reqs_across_dp = num_reqs | ||
| else: | ||
| max_num_reqs_across_dp = max_num_reqs | ||
| dummy_indices = torch.zeros(max_num_reqs_across_dp, | ||
| device=hidden_states.device, | ||
| dtype=torch.int32) | ||
| model.compute_logits(hidden_states[dummy_indices], None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code, which performs a dummy compute_logits call, is a duplicate of the block at lines 2002-2010. The earlier block is always executed when _enable_lmhead_tp() is true, making this second block inside the deepseek_mtp condition redundant. Removing this duplication will improve code maintainability.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2309 +/- ##
==========================================
+ Coverage 72.40% 72.61% +0.20%
==========================================
Files 146 147 +1
Lines 21598 21805 +207
==========================================
+ Hits 15639 15833 +194
- Misses 5959 5972 +13
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ce812a9 to
3b313b7
Compare
### What this PR does / why we need it? In pure dp scenarios (such as DP32), LMHead comptuation takes 1~2ms. In this PR we customize the parallelism of LMHead,enabling the separate TP of LMHead. The computation flow is listed as follows: ``` get_lmhead_group().all_gather # [num_tokens, hid_dim] --> [num_tokens * lmhead_tp, hid_dim] --> lmhead matmul # [num_tokens * lmhead_tp, hid_dim] --> [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> get_lmhead_group().all_to_all # [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> [num_tokens, vocab_size] ``` this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP. In addition, this PR also fixes a bug that introduced by LMHead quantization. The OP `npu_quant_matmul` only accepts dim < 65536, while `vocab_size` is > 65536 if using TP 1. We can set lmhead tp size > 1 to avoid this bug. Main version of this PR: #2309 . ### Does this PR introduce _any_ user-facing change? Yes. We introduced another configurable options `lmhead_tp_size` in ascend_config. For example: ``` additional_config={ "lmhead_tp_size": 16, } ``` The default value is -1, and `lmhead_tp_size` is automatically set to `tensor_parallel_size` in this case. Besides, it is suggested to use it when running full DP to avoid additional communication introduced by TP. Therefore, the parallel size of `lmhead` group will also be changed to `tensor_parallel_size` if TP > 1 so as to fall back to normally TP+DP case. ### How was this patch tested? --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zengyanjia <z00883269@china.huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
047ef04 to
b52fda7
Compare
49b047a to
778b40a
Compare
| if self.use_aux_hidden_state_outputs: | ||
| hidden_states, aux_hidden_states = hidden_states | ||
|
|
||
| if lmhead_tp_enable(): # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If compute_logits is not executed in reqs and execute in dummy_run, will this cause reqs to wait without progress?
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
a32dba3 to
02b5d4b
Compare
…mode scenarios. Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
|
Test failure doesn't relate to this PR. |
### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>
### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> add cann version judgment update ut correct spelling errors Update ut Support v0.10.1 (vllm-project#2584) This patch also supports v0.10.1 No - CI passed - test 0.10.1: vllm-project#2583 - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@321938e Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Fix] Fix DP-related padding logic (vllm-project#2582) The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (**approximately 10% performance gain for TPOT!**). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. None. Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@c5d004a --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> [CI] Add e2e ci test for A3 (vllm-project#2573) Add e2e ci test for A3 - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@11a7faf Signed-off-by: hfadzxy <starmoon_zhang@163.com> [Feat]: Add custom lmhead tensor model parallel (vllm-project#2309) This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Fix import bug Remove whitespace
This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>
This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>
### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>
This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>
### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>
What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode.
In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK.
performance data:

Does this PR introduce any user-facing change?
This PR introduces one new config in
additional_config.example
--additional_config={"lmhead_tensor_parallel_size": 8}How was this patch tested?