Skip to content

Conversation

@Levi-JQ
Copy link
Contributor

@Levi-JQ Levi-JQ commented Nov 28, 2025

What this PR does / why we need it?

In cann8.3, npu_moe_gating_top_k operator can support expert nums with 384, so kimi can use the operator to get better preformance.

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: Levi-JQ <yujinqi2@huawei.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for the 'kimi' model on CANN 8.3 by enabling a fused MoE gating kernel. The changes are applied across several files related to MoE, including different quantization paths. My review includes a critical comment about an inconsistency in one of the quantization files that could lead to incorrect behavior, and a high-severity comment about code duplication that impacts maintainability. Addressing these points will improve the robustness and clarity of the implementation.

Comment on lines +326 to +327
if global_num_experts == 256 or (global_num_experts == 384 and
torch.version.cann.startswith("8.3")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's an inconsistency in how the model type is determined here. This file checks global_num_experts directly, while other files in this PR (e.g., torchair_w8a8_dynamic.py) check against the effective number of experts (global_num_experts - global_redundant_expert_num). This could lead to incorrect kernel selection if global_redundant_expert_num is non-zero, which would be a bug. The logic should be consistent across all files. The original logic if global_num_experts == 256: was also likely incorrect for the same reason.

Suggested change
if global_num_experts == 256 or (global_num_experts == 384 and
torch.version.cann.startswith("8.3")):
if (global_num_experts - global_redundant_expert_num == 256) or \
((global_num_experts - global_redundant_expert_num == 384) and torch.version.cann.startswith("8.3")):

Comment on lines 182 to +185
is_deepseek_v3_r1 = global_num_experts - global_redundant_expert_num == 256
if is_deepseek_v3_r1:
is_kimi = global_num_experts - global_redundant_expert_num == 384
# NOTE: now npu_moe_gating_top_k can support `group_count=256` pattern, and `group_count=384` pattern in cann8.3
if is_deepseek_v3_r1 or (is_kimi and torch.version.cann.startswith("8.3")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The logic to identify deepseek_v3_r1 and kimi models using magic numbers (256, 384), and the check for CANN version 8.3, is duplicated across multiple files (experts_selector.py, torchair_fused_moe.py, torchair_w8a8_dynamic.py, and torchair_w4a8_dynamic.py). This makes the code harder to maintain and increases the risk of inconsistencies when adding support for new models or CANN versions. Consider centralizing this logic into a helper function or a configuration object for better maintainability and readability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant