-
Notifications
You must be signed in to change notification settings - Fork 530
[Perf][MoE] Improve MoE multistream parallel performace. #1891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
17b02d0 to
c58d29b
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1891 +/- ##
==========================================
- Coverage 73.85% 73.72% -0.13%
==========================================
Files 103 103
Lines 11425 11450 +25
==========================================
+ Hits 8438 8442 +4
- Misses 2987 3008 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| dispose_tensor, get_fused_moe_state) | ||
|
|
||
|
|
||
| def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the apply_mlp_decode function, the first parameter no longer needs to be an array. You can refer to existing apply_mlp implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will change it.
| dispose_tensor, get_fused_moe_state) | ||
|
|
||
|
|
||
| def apply_mlp_decode(hidden_states_wrapper: List[torch.Tensor], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was a new apply_mlp_decode function added? It seems essentially identical to the existing apply_mlp function, except for the split_item and output_dtype fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is added in order to introduce fused op npu_dequant_swiglu_quant which is only used in decode phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is added in order to introduce fused op
npu_dequant_swiglu_quantwhich is only used in decode phase.
We previously used npu_dequant_swiglu_quant in apply_mlp, triggered under certain conditions.
@realliujiaxu knows more specific details. Discuss with him?
| log2phy: torch.Tensor = None, | ||
| global_redundant_expert_num: int = 0, | ||
| shared_experts: Optional[Any] = None, | ||
| quantized_x_for_share: Optional[Any] = None, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
| global_redundant_expert_num=global_redundant_expert_num, | ||
| shared_experts=shared_experts) | ||
| shared_experts=shared_experts, | ||
| quantized_x_for_share=shared_gate_up, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
8441109 to
c47b811
Compare
|
better rebase after #1653 |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
Signed-off-by: whx-sjtu <2952154980@qq.com>
…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com>
…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com>
…t#1891) This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance. - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@2cc5711 Signed-off-by: whx-sjtu <2952154980@qq.com>
This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance.



Current multi-stream parallel for shared experts are shown in following pic:
Performance change:
Before:
After:
This PR ports PR #1561 of v0.9.1-dev to main.