-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
Fix Llama4 FlashInfer FP4 MoE issues #22511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Llama4 FlashInfer FP4 MoE issues #22511
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces necessary fixes to enable Llama4 models with FlashInfer FP4 MoE. The changes correctly remove assertions that previously blocked the use of apply_router_weight_on_input, which is required for Llama4. Additionally, the pull request robustly handles cases where expert grouping is not used by providing a default value of 0 for group-related parameters passed to the FlashInfer kernel, preventing potential runtime errors. The changes are well-targeted and appear correct.
acbae8e to
33991a8
Compare
vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py
Outdated
Show resolved
Hide resolved
33991a8 to
53de639
Compare
FlashInfer cutlass FP4 MoE already support Llama4, so remove unncessary asserts and set group to 0 when not used. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Head branch was pushed to by a user without write access
53de639 to
194bc87
Compare
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>
Signed-off-by: Po-Han Huang <pohanh@nvidia.com>
FlashInfer cutlass FP4 MoE already support Llama4, so remove unncessary asserts and set group to 0 when not used.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Test Plan
Llama4 Scout FP4 TP2 concurrency 128 gsm8k accuracy test
Test Result
(Optional) Documentation Update