[Feat]: Add custom lmhead tensor model parallel #2309

lidenghui1110 · 2025-08-11T04:21:00Z

What this PR does / why we need it?

This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode.

In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK.

performance data:

Does this PR introduce any user-facing change?

This PR introduces one new config in additional_config.

Name	Effect	Required	Type	Constraints
lmhead_tensor_parallel_size	Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces	No	int	default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value.

example

--additional_config={"lmhead_tensor_parallel_size": 8}

How was this patch tested?

vLLM version: v0.10.1.1
vLLM main: vllm-project/vllm@de533ab

gemini-code-assist

Summary of Changes

Hello @lidenghui1110, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've introduced a custom implementation for LMhead tensor model parallelism, primarily targeting Deepseek models on Ascend NPUs. The main purpose of this change is to significantly reduce memory consumption and improve performance, specifically Tokens Per Second Output Throughput (TPOT), in pure Data Parallel (DP) scenarios. This feature supports both eager and graph modes, providing flexibility in deployment.

Highlights

New Configuration Parameter: A new configuration parameter, lmhead_tensor_parallel_size, has been added to additional_config in vllm_ascend/ascend_config.py, allowing users to specify the degree of LMhead tensor parallelism.
Dedicated Parallel Group: A dedicated _LMTP (LMhead Tensor Parallel) group coordinator has been introduced in vllm_ascend/distributed/parallel_state.py to manage the communication for LMhead tensor parallelism.
Custom LMHead and Logits Processing: Custom implementations of ParallelLMHead and LogitsProcessor, named CustomParallelLMHead and CustomLogitsProcessor, have been added in vllm_ascend/ops/vocab_parallel_embedding.py. These classes handle the distributed computation of logits, including gathering hidden states and scattering logits across the LMhead tensor parallel group.
Integration with VLLM Config: The lmhead_tensor_parallel_size is now propagated from AscendConfig to VllmConfig's ParallelConfig in vllm_ascend/platform.py, ensuring the configuration is accessible throughout the system.
Runtime Adjustments: Adjustments have been made in vllm_ascend/worker/model_runner_v1.py to correctly handle sample indices and logits during model execution and dummy runs when LMhead tensor parallelism is enabled, ensuring proper data alignment across parallel ranks.
Constraint Enforcement: Assertions have been added to enforce that lmhead_tensor_parallel_size is only enabled in pure Data Parallel (DP) scenarios (where tensor_parallel_size == 1) and when graph mode is enabled, ensuring correct usage and preventing misconfigurations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces LM head tensor parallelism for pure DP scenarios on Ascend NPUs. The changes look good overall, but I've found a critical bug related to an undefined variable and some opportunities for code improvement, including removing duplicated code and adhering to Python styling conventions.

gemini-code-assist · 2025-08-11T04:23:11Z

vllm_ascend/worker/model_runner_v1.py

+        if _enable_lmhead_tp(): # 
+            if not with_prefill:
+                max_num_reqs_across_dp = padded_num_tokens_across_dp
+            else:
+                max_num_reqs_across_dp = self.max_num_reqs
+            sample_indices = nn.functional.pad(
+                sample_indices,
+                (0, max_num_reqs_across_dp - sample_indices.shape[0]))


The variable sample_indices is used here but it has not been defined anywhere in this function, which will cause a NameError at runtime. Based on the context and the function's return values, it seems this should be logits_indices.

Additionally, the comment on line 1338 has a trailing # which appears to be a typo.

Suggested change

if _enable_lmhead_tp(): #

if not with_prefill:

max_num_reqs_across_dp = padded_num_tokens_across_dp

else:

max_num_reqs_across_dp = self.max_num_reqs

sample_indices = nn.functional.pad(

sample_indices,

(0, max_num_reqs_across_dp - sample_indices.shape[0]))

if _enable_lmhead_tp():

if not with_prefill:

max_num_reqs_across_dp = padded_num_tokens_across_dp

else:

max_num_reqs_across_dp = self.max_num_reqs

logits_indices = nn.functional.pad(

logits_indices,

(0, max_num_reqs_across_dp - logits_indices.shape[0]))

gemini-code-assist · 2025-08-11T04:23:11Z

vllm_ascend/ascend_config.py

+            assert(
+                vllm_config.parallel_config.tensor_parallel_size == 1
+            ),"lmhead_tensor_parallel_size is only supported in the pure DP scenario"
+            assert(
+                self.torchair_graph_config.enabled == True
+            ), "lmhead_tensor_parallel_size is only supported in graph mode"


The assert statements can be simplified for better readability and to follow standard Python conventions. The parentheses around the assertion condition are unnecessary, and == True can be omitted for boolean checks.

Suggested change

assert(

vllm_config.parallel_config.tensor_parallel_size == 1

),"lmhead_tensor_parallel_size is only supported in the pure DP scenario"

assert(

self.torchair_graph_config.enabled == True

), "lmhead_tensor_parallel_size is only supported in graph mode"

assert vllm_config.parallel_config.tensor_parallel_size == 1, (

"lmhead_tensor_parallel_size is only supported in the pure DP scenario"

)

assert self.torchair_graph_config.enabled, (

"lmhead_tensor_parallel_size is only supported in graph mode"

)

vllm_ascend/ops/vocab_parallel_embedding.py

gemini-code-assist · 2025-08-11T04:23:11Z

vllm_ascend/utils.py

+def _enable_lmhead_tp() -> bool:
+    if get_ascend_config().lmhead_tensor_parallel_size is not None:
+        return True
+    return False


The function _enable_lmhead_tp can be simplified to a single return statement for conciseness.

def _enable_lmhead_tp() -> bool: return get_ascend_config().lmhead_tensor_parallel_size is not None

gemini-code-assist · 2025-08-11T04:23:11Z

vllm_ascend/worker/model_runner_v1.py

+                if not self.in_profile_run and _enable_lmhead_tp():
+                    if not with_prefill:
+                        max_num_reqs_across_dp = num_reqs
+                    else:
+                        max_num_reqs_across_dp = max_num_reqs
+                    dummy_indices = torch.zeros(max_num_reqs_across_dp,
+                                                device=hidden_states.device,
+                                                dtype=torch.int32)
+                    model.compute_logits(hidden_states[dummy_indices], None)


This block of code, which performs a dummy compute_logits call, is a duplicate of the block at lines 2002-2010. The earlier block is always executed when _enable_lmhead_tp() is true, making this second block inside the deepseek_mtp condition redundant. Removing this duplication will improve code maintainability.

github-actions · 2025-08-11T04:51:23Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

github-actions · 2025-08-11T10:36:51Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

codecov · 2025-08-12T09:48:37Z

Codecov Report

❌ Patch coverage is 84.09091% with 35 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.61%. Comparing base (dfc7eb3) to head (8c80738).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/worker/model_runner_v1.py	0.00%	17 Missing ⚠️
vllm_ascend/ops/vocab_parallel_embedding.py	85.91%	10 Missing ⚠️
vllm_ascend/worker/mtp_proposer_v1.py	11.11%	8 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2309      +/-   ##
==========================================
+ Coverage   72.40%   72.61%   +0.20%     
==========================================
  Files         146      147       +1     
  Lines       21598    21805     +207     
==========================================
+ Hits        15639    15833     +194     
- Misses       5959     5972      +13

Flag	Coverage Δ
unittests	`72.61% <84.09%> (+0.20%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

### What this PR does / why we need it? In pure dp scenarios (such as DP32)， LMHead comptuation takes 1~2ms. In this PR we customize the parallelism of LMHead，enabling the separate TP of LMHead. The computation flow is listed as follows: ``` get_lmhead_group().all_gather # [num_tokens, hid_dim] --> [num_tokens * lmhead_tp, hid_dim] --> lmhead matmul # [num_tokens * lmhead_tp, hid_dim] --> [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> get_lmhead_group().all_to_all # [num_tokens * lmhead_tp, vocab_size // lmhead_tp] --> [num_tokens, vocab_size] ``` this can decrease 0.5~1ms for deepseek with 28BS on a single die、MTP. In addition, this PR also fixes a bug that introduced by LMHead quantization. The OP `npu_quant_matmul` only accepts dim < 65536, while `vocab_size` is > 65536 if using TP 1. We can set lmhead tp size > 1 to avoid this bug. Main version of this PR: #2309 . ### Does this PR introduce _any_ user-facing change? Yes. We introduced another configurable options `lmhead_tp_size` in ascend_config. For example: ``` additional_config={ "lmhead_tp_size": 16, } ``` The default value is -1, and `lmhead_tp_size` is automatically set to `tensor_parallel_size` in this case. Besides, it is suggested to use it when running full DP to avoid additional communication introduced by TP. Therefore, the parallel size of `lmhead` group will also be changed to `tensor_parallel_size` if TP > 1 so as to fall back to normally TP+DP case. ### How was this patch tested? --------- Signed-off-by: angazenn <zengyanjia@huawei.com> Signed-off-by: zengyanjia <z00883269@china.huawei.com> Co-authored-by: angazenn <zengyanjia@huawei.com> Co-authored-by: zengyanjia <z00883269@china.huawei.com>

github-actions · 2025-08-14T23:37:38Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

momo609 · 2025-08-19T08:15:51Z

vllm_ascend/worker/model_runner_v1.py

        if self.use_aux_hidden_state_outputs:
            hidden_states, aux_hidden_states = hidden_states

+        if lmhead_tp_enable():  #


If compute_logits is not executed in reqs and execute in dummy_run, will this cause reqs to wait without progress?

github-actions · 2025-08-20T01:03:10Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

…mode scenarios. Signed-off-by: zzhx1 <zzh_201018@outlook.com>

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

wangxiyuan · 2025-08-29T03:39:09Z

Test failure doesn't relate to this PR.

### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>

### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

@MengqingCao

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com> add cann version judgment update ut correct spelling errors Update ut Support v0.10.1 (vllm-project#2584) This patch also supports v0.10.1 No - CI passed - test 0.10.1: vllm-project#2583 - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@321938e Signed-off-by: Yikun Jiang <yikunkero@gmail.com> [Fix] Fix DP-related padding logic (vllm-project#2582) The determination of attention state, padding, and other forward metadata has been moved to an earlier stage within the input preparation process. This change enables us to utilize a single all-reduce operation, maximizing synchronization efficiency as early as possible. The logic for synchronizing metadata—such as the number of tokens, prefill status, and DBO status—across data parallel (DP) ranks has now been unified and simplified. For performance improvements, the all-reduce operation has been switched from the `gloo` backend to the `npu` backend, which results in an reduction of several milliseconds per step (**approximately 10% performance gain for TPOT!**). Additionally, the multi-DP server hang issue has been resolved, ensuring no more hangs occur when `num_requests < dp_size`. Alas, a relief. Finally, the miscalculated memory usage issue has been addressed by removing the unnecessary `DummyCommImpl`, allowing the system to use the real communication method when determining available memory. None. Maybe we should add an test case for multi-DP online server? @MengqingCao - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@c5d004a --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> [CI] Add e2e ci test for A3 (vllm-project#2573) Add e2e ci test for A3 - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@11a7faf Signed-off-by: hfadzxy <starmoon_zhang@163.com> [Feat]: Add custom lmhead tensor model parallel (vllm-project#2309) This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Fix import bug Remove whitespace

This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>

### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com> Signed-off-by: lijiaojiao <lijiaojiao990304@163.com>

### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>

This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>

### What this PR does / why we need it? This PR introduces LMhead tensor model parallel to achieve decreasing of memory consumption, and TPOT performance improvement. It support both eager mode and graph mode. In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved 1.48 GB NPU memory per RANK. performance data: <img width="1444" height="438" alt="image" src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0" /> ### Does this PR introduce _any_ user-facing change? This PR introduces one new config in `additional_config`. | Name | Effect | Required | Type | Constraints | | :---------------------------- | :--------------------------------------- | :------- | :--- | :----------------- | | lmhead_tensor_parallel_size | Split the lm_head matrix along the column dimension (vocab_size) into lmhead_tensor_parallel_size pieces | No | int | default value is None, once this value is set, the feature will be enabled, vocab_size must be divisible by this value. | example `--additional_config={"lmhead_tensor_parallel_size": 8}` ### How was this patch tested? - vLLM version: v0.10.1.1 - vLLM main: vllm-project/vllm@de533ab --------- Signed-off-by: zzhx1 <zzh_201018@outlook.com> Co-authored-by: zhangzihang <zzh_201018@outlook.com>

gemini-code-assist bot reviewed Aug 11, 2025

View reviewed changes

github-actions bot added module:ops module:core merge-conflicts labels Aug 11, 2025

zzhx1 force-pushed the lmhead-tp branch from 7b3feef to 780ff6f Compare August 12, 2025 06:40

Angazenn mentioned this pull request Aug 12, 2025

[0.9.1] Add LMhead TP communication groups. #1956

Merged

zzhx1 force-pushed the lmhead-tp branch from 31938ed to 073688c Compare August 12, 2025 07:55

github-actions bot removed the merge-conflicts label Aug 12, 2025

zzhx1 force-pushed the lmhead-tp branch from e03c536 to b73aeee Compare August 12, 2025 08:57

zzhx1 force-pushed the lmhead-tp branch 2 times, most recently from ce812a9 to 3b313b7 Compare August 12, 2025 11:23

github-actions bot added the merge-conflicts label Aug 14, 2025

zzhx1 force-pushed the lmhead-tp branch from 502fff2 to 341ddc9 Compare August 18, 2025 12:51

github-actions bot removed the merge-conflicts label Aug 18, 2025

zzhx1 force-pushed the lmhead-tp branch 2 times, most recently from 047ef04 to b52fda7 Compare August 18, 2025 13:34

github-actions bot added the module:tests label Aug 18, 2025

zzhx1 force-pushed the lmhead-tp branch 3 times, most recently from 49b047a to 778b40a Compare August 19, 2025 02:41

momo609 reviewed Aug 19, 2025

View reviewed changes

github-actions bot added the merge-conflicts label Aug 20, 2025

zzhx1 force-pushed the lmhead-tp branch from 7e18d71 to 63e3194 Compare August 21, 2025 02:15

lidenghui1110 mentioned this pull request Aug 28, 2025

[Feat] Add custom Embedding tensor model parallel #2616

Open

zzhx1 force-pushed the lmhead-tp branch 5 times, most recently from a32dba3 to 02b5d4b Compare August 29, 2025 01:48

zzhx1 added 10 commits August 29, 2025 10:22

[Feat]: Add custom lmhead tensor model parallel in pure DP and graph-…

b8dff81

…mode scenarios. Signed-off-by: zzhx1 <zzh_201018@outlook.com>

pad tensor in compute_logits and adapting to the mixed deployment mode

c0993bc

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

Fix bugs and CI and optimize codes

6841733

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

Register VocabParallelEmbedding as a custom op for Ascend

e4d6bec

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

[CI] fix

f7efbc6

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

[CI] fix UT about vocab_parallel_embedding

9af7601

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

fix code from reviewer

f8b99dc

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

rebase custom AscendVocabParallelEmbedding

cb4090b

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

rebase custom main

ca30d37

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

[CI] fix

8c80738

Signed-off-by: zzhx1 <zzh_201018@outlook.com>

zzhx1 force-pushed the lmhead-tp branch from 02b5d4b to 8c80738 Compare August 29, 2025 02:22

wangxiyuan approved these changes Aug 29, 2025

View reviewed changes

wangxiyuan merged commit 600b08f into vllm-project:main Aug 29, 2025
25 of 27 checks passed

Yikun mentioned this pull request Sep 20, 2025

[Bug]: Remove outofdate commits to improve perf test #3051

Open

[Feat]: Add custom lmhead tensor model parallel #2309

[Feat]: Add custom lmhead tensor model parallel #2309

Uh oh!

Conversation

lidenghui1110 commented Aug 11, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gemini-code-assist bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

github-actions bot commented Aug 11, 2025

Uh oh!

codecov bot commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions bot commented Aug 14, 2025

Uh oh!

momo609 Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Aug 20, 2025

Uh oh!

wangxiyuan commented Aug 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lidenghui1110 commented Aug 11, 2025 •

edited by github-actions bot

Loading

codecov bot commented Aug 12, 2025 •

edited

Loading