Skip to content

Conversation

@lidenghui1110
Copy link
Contributor

@lidenghui1110 lidenghui1110 commented Aug 11, 2025

What this PR does / why we need it?

This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK.

performance data:
image

Does this PR introduce any user-facing change?

This PR introduces one new config in additional_config.

Name Effect Required Type Constraints
lmhead_tensor_parallel_size Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces No int default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value.

example

--additional_config={"lmhead_tensor_parallel_size": 8}

How was this patch tested?

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @lidenghui1110, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've introduced a custom implementation for LMhead tensor model parallelism, primarily targeting Deepseek models on Ascend NPUs. The main purpose of this change is to significantly reduce memory consumption and improve performance, specifically Tokens Per Second Output Throughput (TPOT), in pure Data Parallel (DP) scenarios. This feature supports both eager and graph modes, providing flexibility in deployment.

Highlights

  • New Configuration Parameter: A new configuration parameter, lmhead_tensor_parallel_size, has been added to additional_config in vllm_ascend/ascend_config.py, allowing users to specify the degree of LMhead tensor parallelism.
  • Dedicated Parallel Group: A dedicated _LMTP (LMhead Tensor Parallel) group coordinator has been introduced in vllm_ascend/distributed/parallel_state.py to manage the communication for LMhead tensor parallelism.
  • Custom LMHead and Logits Processing: Custom implementations of ParallelLMHead and LogitsProcessor, named CustomParallelLMHead and CustomLogitsProcessor, have been added in vllm_ascend/ops/vocab_parallel_embedding.py. These classes handle the distributed computation of logits, including gathering hidden states and scattering logits across the LMhead tensor parallel group.
  • Integration with VLLM Config: The lmhead_tensor_parallel_size is now propagated from AscendConfig to VllmConfig's ParallelConfig in vllm_ascend/platform.py, ensuring the configuration is accessible throughout the system.
  • Runtime Adjustments: Adjustments have been made in vllm_ascend/worker/model_runner_v1.py to correctly handle sample indices and logits during model execution and dummy runs when LMhead tensor parallelism is enabled, ensuring proper data alignment across parallel ranks.
  • Constraint Enforcement: Assertions have been added to enforce that lmhead_tensor_parallel_size is only enabled in pure Data Parallel (DP) scenarios (where tensor_parallel_size == 1) and when graph mode is enabled, ensuring correct usage and preventing misconfigurations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces LM head tensor parallelism for pure DP scenarios on Ascend NPUs. The changes look good overall, but I've found a critical bug related to an undefined variable and some opportunities for code improvement, including removing duplicated code and adhering to Python styling conventions.

Comment on lines 1338 to 1378
if _enable_lmhead_tp(): #
if not with_prefill:
max_num_reqs_across_dp = padded_num_tokens_across_dp
else:
max_num_reqs_across_dp = self.max_num_reqs
sample_indices = nn.functional.pad(
sample_indices,
(0, max_num_reqs_across_dp - sample_indices.shape[0]))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The variable sample_indices is used here but it has not been defined anywhere in this function, which will cause a NameError at runtime. Based on the context and the function's return values, it seems this should be logits_indices.

Additionally, the comment on line 1338 has a trailing # which appears to be a typo.

Suggested change
if _enable_lmhead_tp(): #
if not with_prefill:
max_num_reqs_across_dp = padded_num_tokens_across_dp
else:
max_num_reqs_across_dp = self.max_num_reqs
sample_indices = nn.functional.pad(
sample_indices,
(0, max_num_reqs_across_dp - sample_indices.shape[0]))
if _enable_lmhead_tp():
if not with_prefill:
max_num_reqs_across_dp = padded_num_tokens_across_dp
else:
max_num_reqs_across_dp = self.max_num_reqs
logits_indices = nn.functional.pad(
logits_indices,
(0, max_num_reqs_across_dp - logits_indices.shape[0]))

Comment on lines 54 to 64
assert(
vllm_config.parallel_config.tensor_parallel_size == 1
),"lmhead_tensor_parallel_size is only supported in the pure DP scenario"
assert(
self.torchair_graph_config.enabled == True
), "lmhead_tensor_parallel_size is only supported in graph mode"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The assert statements can be simplified for better readability and to follow standard Python conventions. The parentheses around the assertion condition are unnecessary, and == True can be omitted for boolean checks.

Suggested change
assert(
vllm_config.parallel_config.tensor_parallel_size == 1
),"lmhead_tensor_parallel_size is only supported in the pure DP scenario"
assert(
self.torchair_graph_config.enabled == True
), "lmhead_tensor_parallel_size is only supported in graph mode"
assert vllm_config.parallel_config.tensor_parallel_size == 1, (
"lmhead_tensor_parallel_size is only supported in the pure DP scenario"
)
assert self.torchair_graph_config.enabled, (
"lmhead_tensor_parallel_size is only supported in graph mode"
)

Comment on lines 510 to 513
def _enable_lmhead_tp() -> bool:
if get_ascend_config().lmhead_tensor_parallel_size is not None:
return True
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function _enable_lmhead_tp can be simplified to a single return statement for conciseness.

def _enable_lmhead_tp() -> bool:
    return get_ascend_config().lmhead_tensor_parallel_size is not None

Comment on lines 2020 to 2027
if not self.in_profile_run and _enable_lmhead_tp():
if not with_prefill:
max_num_reqs_across_dp = num_reqs
else:
max_num_reqs_across_dp = max_num_reqs
dummy_indices = torch.zeros(max_num_reqs_across_dp,
device=hidden_states.device,
dtype=torch.int32)
model.compute_logits(hidden_states[dummy_indices], None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This block of code, which performs a dummy compute_logits call, is a duplicate of the block at lines 2002-2010. The earlier block is always executed when _enable_lmhead_tp() is true, making this second block inside the deepseek_mtp condition redundant. Removing this duplication will improve code maintainability.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@codecov
Copy link

codecov bot commented Aug 12, 2025

Codecov Report

❌ Patch coverage is 84.09091% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.61%. Comparing base (dfc7eb3) to head (8c80738).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
vllm_ascend/worker/model_runner_v1.py 0.00% 17 Missing ⚠️
vllm_ascend/ops/vocab_parallel_embedding.py 85.91% 10 Missing ⚠️
vllm_ascend/worker/mtp_proposer_v1.py 11.11% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2309      +/-   ##
==========================================
+ Coverage   72.40%   72.61%   +0.20%     
==========================================
  Files         146      147       +1     
  Lines       21598    21805     +207     
==========================================
+ Hits        15639    15833     +194     
- Misses       5959     5972      +13     
Flag Coverage Δ
unittests 72.61% <84.09%> (+0.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zzhx1 zzhx1 force-pushed the lmhead-tp branch 2 times, most recently from ce812a9 to 3b313b7 Compare August 12, 2025 11:23
ganyi1996ppo pushed a commit that referenced this pull request Aug 14, 2025
### What this PR does / why we need it?
In pure dp scenarios (such as DP32), LMHead comptuation takes 1~2ms. In
this PR we customize the parallelism of LMHead,enabling the separate TP
of LMHead. The computation flow is listed as follows:

```
get_lmhead_group().all_gather  # [num_tokens, hid_dim] -->  [num_tokens * lmhead_tp, hid_dim]
--> lmhead matmul  # [num_tokens * lmhead_tp, hid_dim] -->  [num_tokens * lmhead_tp, vocab_size //  lmhead_tp]
--> get_lmhead_group().all_to_all  # [num_tokens * lmhead_tp, vocab_size //  lmhead_tp] --> [num_tokens, vocab_size]
```

this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP.

In addition, this PR also fixes a bug that introduced by LMHead
quantization. The OP `npu_quant_matmul` only accepts dim < 65536, while
`vocab_size` is > 65536 if using TP 1. We can set lmhead tp size > 1 to
avoid this bug.

Main version of this PR: #2309 .

### Does this PR introduce _any_ user-facing change?
Yes. We introduced another configurable options `lmhead_tp_size` in
ascend_config. For example:
```
additional_config={
        "lmhead_tp_size": 16,
}
```
The default value is -1, and `lmhead_tp_size` is automatically set to
`tensor_parallel_size` in this case. Besides, it is suggested to use it
when running full DP to avoid additional communication introduced by TP.
Therefore, the parallel size of `lmhead` group will also be changed to
`tensor_parallel_size` if TP > 1 so as to fall back to normally TP+DP
case.

### How was this patch tested?


---------

Signed-off-by: angazenn <zengyanjia@huawei.com>
Signed-off-by: zengyanjia <z00883269@china.huawei.com>
Co-authored-by: angazenn <zengyanjia@huawei.com>
Co-authored-by: zengyanjia <z00883269@china.huawei.com>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@zzhx1 zzhx1 force-pushed the lmhead-tp branch 2 times, most recently from 047ef04 to b52fda7 Compare August 18, 2025 13:34
@zzhx1 zzhx1 force-pushed the lmhead-tp branch 3 times, most recently from 49b047a to 778b40a Compare August 19, 2025 02:41
if self.use_aux_hidden_state_outputs:
hidden_states, aux_hidden_states = hidden_states

if lmhead_tp_enable(): #
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If compute_logits is not executed in reqs and execute in dummy_run, will this cause reqs to wait without progress?

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

zzhx1 added 10 commits August 29, 2025 10:22
…mode scenarios.

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
@wangxiyuan
Copy link
Collaborator

Test failure doesn't relate to this PR.

@wangxiyuan wangxiyuan merged commit 600b08f into vllm-project:main Aug 29, 2025
25 of 27 checks passed
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Aug 29, 2025
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
weijinqian0 pushed a commit to weijinqian0/vllm-ascend that referenced this pull request Aug 29, 2025
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
anon189Ty added a commit to anon189Ty/vllm-ascend that referenced this pull request Aug 29, 2025
Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>

add cann version judgment

update ut

correct spelling errors

Update ut

Support v0.10.1 (vllm-project#2584)

This patch also supports v0.10.1

No

- CI passed
- test 0.10.1: vllm-project#2583
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@321938e

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>

[Fix] Fix DP-related padding logic (vllm-project#2582)

The determination of attention state, padding, and other forward
metadata has been moved to an earlier stage within the input preparation
process. This change enables us to utilize a single all-reduce
operation, maximizing synchronization efficiency as early as possible.

The logic for synchronizing metadata—such as the number of tokens,
prefill status, and DBO status—across data parallel (DP) ranks has now
been unified and simplified.

For performance improvements, the all-reduce operation has been switched
from the `gloo` backend to the `npu` backend, which results in an
reduction of several milliseconds per step (**approximately 10%
performance gain for TPOT!**).

Additionally, the multi-DP server hang issue has been resolved, ensuring
no more hangs occur when `num_requests < dp_size`. Alas, a relief.

Finally, the miscalculated memory usage issue has been addressed by
removing the unnecessary `DummyCommImpl`, allowing the system to use the
real communication method when determining available memory.

None.

Maybe we should add an test case for multi-DP online server?
@MengqingCao

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@c5d004a

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

[CI] Add e2e ci test for A3 (vllm-project#2573)

Add e2e ci test for A3

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@11a7faf

Signed-off-by: hfadzxy <starmoon_zhang@163.com>

[Feat]: Add custom lmhead tensor model parallel (vllm-project#2309)

This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>

Fix import bug

Remove whitespace
pillumina pushed a commit to pillumina/vllm-ascend that referenced this pull request Sep 4, 2025
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
pillumina pushed a commit to pillumina/vllm-ascend that referenced this pull request Sep 4, 2025
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
wenba0 pushed a commit to wenba0/vllm-ascend that referenced this pull request Sep 5, 2025
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>
wangxiaoteng888 pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Sep 25, 2025
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
chopper0126 pushed a commit to chopper0126/vllm-ascend that referenced this pull request Sep 26, 2025
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
Angazenn pushed a commit to Angazenn/vllm-ascend that referenced this pull request Oct 21, 2025
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.

performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>

### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |

example

`--additional_config={"lmhead_tensor_parallel_size": 8}`

### How was this patch tested?


- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation module:core module:ops module:tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants