Fix Llama4 FlashInfer FP4 MoE issues #22511

nvpohanh · 2025-08-08T08:48:59Z

FlashInfer cutlass FP4 MoE already support Llama4, so remove unncessary asserts and set group to 0 when not used.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Test Plan

Llama4 Scout FP4 TP2 concurrency 128 gsm8k accuracy test

Test Result

local-completions (base_url=http://0.0.0.0:8080/v1/completions,model=nvidia/Llama-4-Scout-17B-16E-Instruct-FP4,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9060|±  |0.0080|
|     |       |strict-match    |     5|exact_match|↑  |0.8901|±  |0.0086|

(Optional) Documentation Update

github-actions · 2025-08-08T08:49:06Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces necessary fixes to enable Llama4 models with FlashInfer FP4 MoE. The changes correctly remove assertions that previously blocked the use of apply_router_weight_on_input, which is required for Llama4. Additionally, the pull request robustly handles cases where expert grouping is not used by providing a default value of 0 for group-related parameters passed to the FlashInfer kernel, preventing potential runtime errors. The changes are well-targeted and appear correct.

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py

FlashInfer cutlass FP4 MoE already support Llama4, so remove unncessary asserts and set group to 0 when not used. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

nvpohanh requested review from mgoin, robertgshaw2-redhat and tlrmchlsmth as code owners August 8, 2025 08:49

mergify bot added the llama Related to Llama models label Aug 8, 2025

gemini-code-assist bot reviewed Aug 8, 2025

View reviewed changes

nvpohanh force-pushed the dev/nvpohanh/llama4-fp4-moe-fix branch from acbae8e to 33991a8 Compare August 8, 2025 09:07

mgoin reviewed Aug 10, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_moe.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/fused_moe/flashinfer_cutlass_prepare_finalize.py Outdated Show resolved Hide resolved

nvpohanh force-pushed the dev/nvpohanh/llama4-fp4-moe-fix branch from 33991a8 to 53de639 Compare August 11, 2025 02:54

nvpohanh mentioned this pull request Aug 11, 2025

Upgrade FlashInfer to v0.2.11 #22613

Merged

4 tasks

mgoin approved these changes Aug 11, 2025

View reviewed changes

mgoin enabled auto-merge (squash) August 11, 2025 16:28

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 11, 2025

Fix Llama4 FlashInfer FP4 MoE issues

194bc87

FlashInfer cutlass FP4 MoE already support Llama4, so remove unncessary asserts and set group to 0 when not used. Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

auto-merge was automatically disabled August 12, 2025 06:25
Head branch was pushed to by a user without write access

nvpohanh force-pushed the dev/nvpohanh/llama4-fp4-moe-fix branch from 53de639 to 194bc87 Compare August 12, 2025 06:25

vllm-bot merged commit 67c153b into vllm-project:main Aug 12, 2025
45 of 49 checks passed

paulpak58 pushed a commit to paulpak58/vllm that referenced this pull request Aug 13, 2025

Fix Llama4 FlashInfer FP4 MoE issues (vllm-project#22511)

8f6e78d

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>

diegocastanibm pushed a commit to diegocastanibm/vllm that referenced this pull request Aug 15, 2025

Fix Llama4 FlashInfer FP4 MoE issues (vllm-project#22511)

8ea6f6d

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

yiliu30 pushed a commit to yiliu30/vllm-fork that referenced this pull request Aug 19, 2025

Fix Llama4 FlashInfer FP4 MoE issues (vllm-project#22511)

d52e176

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

epwalsh pushed a commit to epwalsh/vllm that referenced this pull request Aug 28, 2025

Fix Llama4 FlashInfer FP4 MoE issues (vllm-project#22511)

a869075

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

xiao-llm pushed a commit to xiao-llm/vllm that referenced this pull request Aug 28, 2025

Fix Llama4 FlashInfer FP4 MoE issues (vllm-project#22511)

95dc20a

Signed-off-by: Po-Han Huang <pohanh@nvidia.com> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

zhewenl pushed a commit to zhewenl/vllm that referenced this pull request Aug 28, 2025

Fix Llama4 FlashInfer FP4 MoE issues (vllm-project#22511)

3299be6

Signed-off-by: Po-Han Huang <pohanh@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix Llama4 FlashInfer FP4 MoE issues #22511

Fix Llama4 FlashInfer FP4 MoE issues #22511

Uh oh!

nvpohanh commented Aug 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Fix Llama4 FlashInfer FP4 MoE issues #22511

Fix Llama4 FlashInfer FP4 MoE issues #22511

Uh oh!

Conversation

nvpohanh commented Aug 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nvpohanh commented Aug 8, 2025 •

edited by github-actions bot

Loading