Skip to content

Conversation

@whx-sjtu
Copy link
Collaborator

@whx-sjtu whx-sjtu commented Jul 1, 2025

This PR designs the shared expert multi-stream parallelism of w8a8-dynamic-quantized MoE stage in more detail to achieve better performance.
Current multi-stream parallel for shared experts are shown in following pic:
image
Performance change:
Before:
image
After:
image

Signed-off-by: whx-sjtu <2952154980@qq.com>
Signed-off-by: whx-sjtu <2952154980@qq.com>
@ganyi1996ppo ganyi1996ppo merged commit 65909b2 into vllm-project:v0.9.1-dev Jul 3, 2025
16 checks passed
@Yikun Yikun added the no-main label Jul 14, 2025
layer.w2_weight.data, ACL_FORMAT_FRACTAL_NZ)
layer.w13_weight_scale.data = layer.w13_weight_scale.data.view(
layer.w13_weight_scale.data.shape[0], -1)
layer.w13_weight_scale_fp32 = layer.w13_weight_scale.data.to(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does w13_weight_scale need to be converted to fp32?

@whx-sjtu whx-sjtu deleted the moe_ms_091 branch October 20, 2025 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants