-
Notifications
You must be signed in to change notification settings - Fork 544
use npu_moe_gating_top_k_softmax #1355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use npu_moe_gating_top_k_softmax #1355
Conversation
Signed-off-by: ttanzhiqiang <389825161@qq.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1355 +/- ##
===========================================
+ Coverage 27.39% 54.54% +27.15%
===========================================
Files 56 80 +24
Lines 6191 9980 +3789
===========================================
+ Hits 1696 5444 +3748
- Misses 4495 4536 +41
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
|
@ganyi1996ppo Please help review |
|
Your screenshot seems only contains host time, can you paste the device time of this kernel too? |
| # value to False to disable the optimized model. | ||
| "USE_OPTIMIZED_MODEL": | ||
| lambda: bool(int(os.getenv('USE_OPTIMIZED_MODEL', '1'))), | ||
| "SELECT_GATING_TOPK_SOTFMAX_EXPERTS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont' think we should add more config. Instead, how about check the model type to decide which function will be called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
select_gating_top_k_softmax_experts is theoretically better than select_experts in non-quantization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discuss offline, we'll remove this env once the function is stable enough
Signed-off-by: ttanzhiqiang <389825161@qq.com>
2314772 to
77d1b16
Compare
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
| topk_weights, topk_ids, row_idx = torch_npu.npu_moe_gating_top_k_softmax( | ||
| router_logits, None, k=top_k) | ||
|
|
||
| # # Required by npu_moe_init_routing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the comment code
| # value to False to disable the optimized model. | ||
| "USE_OPTIMIZED_MODEL": | ||
| lambda: bool(int(os.getenv('USE_OPTIMIZED_MODEL', '1'))), | ||
| "SELECT_GATING_TOPK_SOTFMAX_EXPERTS": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discuss offline, we'll remove this env once the function is stable enough
…LECT_GATING_TOPK_SOTFMAX_EXPERTS (#2112) backport of v0.9.1-dev: #1902 origin main npu_moe_gating_top_k_softmax: #1355 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@055bd39 Signed-off-by: huangxialu <huangxialu1@huawei.com>
…LECT_GATING_TOPK_SOTFMAX_EXPERTS (vllm-project#2112) backport of v0.9.1-dev: vllm-project#1902 origin main npu_moe_gating_top_k_softmax: vllm-project#1355 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@055bd39 Signed-off-by: huangxialu <huangxialu1@huawei.com>
### What this PR does / why we need it? The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@1a4f35e --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
### What this PR does / why we need it? The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@1a4f35e --------- Signed-off-by: ttanzhiqiang <389825161@qq.com>
…LECT_GATING_TOPK_SOTFMAX_EXPERTS (vllm-project#2112) backport of v0.9.1-dev: vllm-project#1902 origin main npu_moe_gating_top_k_softmax: vllm-project#1355 - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@055bd39 Signed-off-by: huangxialu <huangxialu1@huawei.com>


What this PR does / why we need it?
The optimization solution for non-deepseek select_experts is to replace gating_topk_softmax with softmax+topk+to, which is optimized from 37us to 14us on bf16/fp16 of qwen3-235b


Does this PR introduce any user-facing change?
How was this patch tested?