-
-
Couldn't load subscription status.
- Fork 10.8k
Fix CUDA permute/unpermute for use with DeepGemm Moe #17934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
db26b2d to
d71322b
Compare
d71322b to
78b2480
Compare
benchmark setuppython3 benchmarks/kernels/benchmark_moe.py --dtype=fp8_w8a8 -tp ${TP} --model deepseek-ai/DeepSeek-V3 --trust-remote-code --use-deep-gemmHow to switch python permute/unpermute implementation Performance in H100
re-bench with triton kernel (unit us)
Triton3.1->Triton3.3 rapidly speedups triton fused moe kernel. I think deepgemm moe is slow than triton when M is small because deepgemm group gemm does not support skip invalid block by vllm/vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py Lines 50 to 52 in d74e5f3
|
|
This pull request has merge conflicts that must be resolved before it can be |
78b2480 to
4dc1a00
Compare
e20416d to
3f967f5
Compare
re-bench after #15956 (unit us)
I only integrate customized permute kernel into modularized fused moe, because customized unpermute kernel needs extra parameter inv_perm. It will break modularized interface in |
22a9f55 to
5b24c2c
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can moe_permute get topk internally so that it doesn't need to be passed in as a separate argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, get topk from topk_ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the scale permutation also be done by the CUDA kernel?
Then it seems permuted_idx would not be required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scale permutation is with small cost. There is no huge peformance gain to fuse it into cuda kernel. So I want to just place it in python for flexibility. And remove permuted_idx from return value.
|
LGTM. Just had a few minor comments. |
5b24c2c to
759cd54
Compare
|
@bnellnm I update code with your review. I notice latest |
@CalebDu what's the difference between the Is it basically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @CalebDu
Perf differences comes from |
|
This pull request has merge conflicts that must be resolved before it can be |
759cd54 to
62b4abb
Compare
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
Signed-off-by: Caleb_Du <Caleb_Du@zju.edu.cn>
54de5d3 to
8e17c51
Compare
CI has 3 failure about Distributed Tests (4 GPUs), TPU V1 Test and Entrypoints Test (API Server). It seems unrelated to this PR according to CI log. |
|
@CalebDu I have triggered a retry on the failed tests. What is the commit on |
|
I verified that commit |
What should I update in PR description and title?Describing that |
Since this PR is not doing any integration,
|
Done |
[1] refactor permute/unpermute kernel
token_expert_indicesinput dependence and output align to triton kernel[2]
integrate permute/unpermute kernel into deepgemm moe[3] fix clamp causes wrong argsort result, cc @bnellnm
vllm/vllm/model_executor/layers/fused_moe/deep_gemm_moe.py
Lines 84 to 86 in 7042cc9
I will attach performance data in H100 later.