You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### What this PR does / why we need it?
This PR introduces LMhead tensor model parallel to achieve decreasing of
memory consumption, and TPOT performance improvement. It support both
eager mode and graph mode.
In deepseek r1 w8a8 PD disagregated Decode instance, using pure DP, with
lmhead_tensor_parallel_size = 8, we have 1 ms TPOT optimization, saved
1.48 GB NPU memory per RANK.
performance data:
<img width="1444" height="438" alt="image"
src="https://github.com/user-attachments/assets/3c5ef0d3-a7c7-46fd-9797-4de728eb0cb0"
/>
### Does this PR introduce _any_ user-facing change?
This PR introduces one new config in `additional_config`.
| Name | Effect | Required | Type | Constraints |
| :---------------------------- |
:--------------------------------------- | :------- | :--- |
:----------------- |
| lmhead_tensor_parallel_size | Split the lm_head matrix along the
column dimension (vocab_size) into lmhead_tensor_parallel_size pieces |
No | int | default value is None, once this value is set, the feature
will be enabled, vocab_size must be divisible by this value. |
example
`--additional_config={"lmhead_tensor_parallel_size": 8}`
### How was this patch tested?
- vLLM version: v0.10.1.1
- vLLM main:
vllm-project/vllm@de533ab
---------
Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: zhangzihang <zzh_201018@outlook.com>
Copy file name to clipboardExpand all lines: docs/source/user_guide/configuration/additional_config.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -34,6 +34,7 @@ The following table lists the additional configuration options available in vLLM
34
34
|`enable_prefetch`| bool |`False`| Whether to enable weight prefetch. |
35
35
|`kv_cache_dtype`| str |`None`| When using the kv cache quantization method, kv cache dtype needs to be set, currently only int8 is supported. |
36
36
|`enable_shared_expert_dp`| bool |`False`| When the shared expert in DP, it has better performance but consumes more memory. Currently only DeepSeek series models are supported to use. |
37
+
|`lmhead_tensor_parallel_size`| int |`None`| The custom tensor parallel size of lmhead. |
0 commit comments