Bump Flashinfer to v0.4.0 #26326

elvischenv · 2025-10-07T01:11:00Z

Purpose

Bump Flashinfer to v0.4.0.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

I have reviewed the changes for bumping Flashinfer to v0.4.0rc4. The updates across the Dockerfiles and setup.py are consistent with the version bump. The necessary API adaptation in vllm/v1/attention/backends/flashinfer.py to support the new version of FlashInfer appears correct. I did not find any issues of high or critical severity. The changes look good to me.

chatgpt-codex-connector

💡 Codex Review

vllm/docker/Dockerfile

Lines 395 to 399 in 7cf8f24

    
           if [[ "${CUDA_VERSION}" == 12.8.* ]] && [ "$TARGETPLATFORM" = "linux/amd64" ]; then 
        
               # NOTE: To make new precompiled wheels, see tools/flashinfer-build.sh 
        
               echo "🏗️  Installing FlashInfer from pre-compiled wheel" 
        
               uv pip install --system https://wheels.vllm.ai/flashinfer-python/flashinfer_python-0.3.1-cp39-abi3-manylinux1_x86_64.whl \ 
        
                   --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

Update precompiled FlashInfer wheel version

The Docker build path for CUDA 12.8 on amd64 still installs flashinfer_python-0.3.1 while the rest of this commit bumps the dependency to v0.4.0rc4. Building an image through this branch will ship the old 0.3.1 wheel but the Python code now calls the 0.4 APIs (e.g. additional plan arguments), so the container will crash at runtime once those symbols are invoked. The precompiled wheel URL needs to be updated to the new version to keep the binary compatible with the codebase.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

elvischenv · 2025-10-07T01:28:34Z

@mgoin Do we have pre-compiled Flashinfer 0.4.0rc4 wheel?

vllm/docker/Dockerfile

Lines 395 to 399 in c50901f

    
           if [[ "${CUDA_VERSION}" == 12.8.* ]] && [ "$TARGETPLATFORM" = "linux/amd64" ]; then 
        
               # NOTE: To make new precompiled wheels, see tools/flashinfer-build.sh 
        
               echo "🏗️  Installing FlashInfer from pre-compiled wheel" 
        
               uv pip install --system https://wheels.vllm.ai/flashinfer-python/flashinfer_python-0.3.1-cp39-abi3-manylinux1_x86_64.whl \ 
        
                   --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

mgoin · 2025-10-07T01:31:08Z

@elvischenv not right now, I can make one tomorrow. For now just update the precompiled wheel condition to check for v0.3.1 as well, or skip

elvischenv · 2025-10-07T02:05:03Z

@elvischenv not right now, I can make one tomorrow. For now just update the precompiled wheel condition to check for v0.3.1 as well, or skip

@mgoin Thanks! Added check for v0.3.1 for now.

elvischenv · 2025-10-07T07:40:53Z

tests/quantization/test_blackwell_moe.py

        model,
        server_args,
-        max_wait_seconds=1000,  # Due to FlashInfer compile
+        max_wait_seconds=1500,  # Due to FlashInfer compile


nightly took about 13 min to pass:
https://buildkite.com/vllm/ci/builds/33777/steps/canvas?sid=0199bcd4-8cfa-440e-82e2-568249d7e5f5#0199bcd4-8ddc-4310-b617-e51d72fa6264/87-1271

new flashinfer(this PR) failed in 16 min when loading cubin:
https://buildkite.com/vllm/ci/builds/33787/steps/canvas?sid=0199bd3a-b391-4c5a-8bae-e778a6a71916#0199bd3a-b4ab-4dd6-b108-d0e3659e9a33/90-1150

Try extending from 1000s(16min) to 1500s(25min)

After extending the timeout, seems it took 17min to complete:
https://buildkite.com/vllm/ci/builds/33799/steps/canvas?sid=0199bd99-315d-4e83-9c07-a0866e8dce1d#0199bd99-3382-426a-9967-c454fce447c4/100-1208

pavanimajety · 2025-10-07T13:59:40Z

vllm/v1/attention/backends/flashinfer.py

            False,  # causal
+            window_left,
+            -1,
+            False,


Nit: Please add a comment for what -1 and False stand for.

vllm/v1/attention/backends/flashinfer.py

mgoin

LGTM! Just pushed a fix for the renamed mxfp4 moe test

mgoin · 2025-10-07T18:31:13Z

@elvischenv It looks like the failure in lm-eval for deepseek-coder-v2-lite is related, PTAL https://buildkite.com/vllm/ci/builds/33858/steps/canvas?sid=0199bf8f-5df9-47f7-bf22-76ff5669357c

FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[DeepSeek-V2-Lite-Instruct-FP8-tp1] - AssertionError: Accuracy too low: 0.000 < 0.720 - 0.080
assert 0.0 >= (0.72 - 0.08)

Confirmed that this test was green last night https://buildkite.com/vllm/ci/builds/33777/steps/canvas?sid=0199bcd4-8cfb-452f-9053-9fba4c858dae

Ran the eval locally fine with 0.3.1 and crashing with 0.4.0rc4. It seems consistently to be due to torch.OutOfMemoryError: CUDA out of memory
Command:

vllm serve RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8 --max-model-len 4096 --enforce-eager 
python tests/evals/gsm8k/gsm8k_eval.py

My crash was during mla

(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/attention/layer.py", line 695, in unified_attention_with_output
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     self.impl.forward(
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/v1/attention/backends/mla/common.py", line 1769, in forward
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     output[num_decode_tokens:] = self._forward_prefill(
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]                                  ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/v1/attention/backends/mla/common.py", line 1657, in _forward_prefill
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     context_output, context_lse = self._compute_prefill_context(
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/v1/attention/backends/mla/common.py", line 1483, in _compute_prefill_context
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     k = torch.cat((k_nope, k_pe.expand((*k_nope.shape[:-1], -1))), dim=-1)
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB. GPU 0 has a total capacity of 178.35 GiB of which 209.94 MiB is free. Including non-PyTorch memory, this process has 178.13 GiB memory in use. Of the allocated memory 177.09 GiB is allocated by PyTorch, and 227.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

elvischenv · 2025-10-08T00:56:15Z

@elvischenv It looks like the failure in lm-eval for deepseek-coder-v2-lite is related, PTAL https://buildkite.com/vllm/ci/builds/33858/steps/canvas?sid=0199bf8f-5df9-47f7-bf22-76ff5669357c

Hi @mgoin, this issue has been fixed on flashinfer main ToT. Let's wait for the next flashinfer release.

yewentao256 · 2025-10-08T14:22:52Z

Hi @mgoin, this issue has been fixed on flashinfer main ToT. Let's wait for the next flashinfer release.

Hi @elvischenv do you know when will it be released?

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

bbrowning · 2025-10-10T17:07:23Z

Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see.

elvischenv · 2025-10-11T13:40:38Z

Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see.

The results of gpt_oss.evals is still good after updating flashinfer to 0.4.0.
Do you have any repro steps? If it is actually caused by flashinfer 0.4.0, I could take a look.

bbrowning · 2025-10-13T18:32:39Z

Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see.

The results of gpt_oss.evals is still good after updating flashinfer to 0.4.0. Do you have any repro steps? If it is actually caused by flashinfer 0.4.0, I could take a look.

I have not been able to reproduce this myself yet, as I don't use vLLM within containers nor with flashinfer. It's not clear flashinfer is an issue in the linked report, but it was at least suspected as the culprit because disabling it fixed an infinite generation loop with gpt-oss models, at least for that user.

Steven0236 · 2025-10-14T11:55:03Z

On the topic of weirdness after upgrading to flashinfer 0.4.0, I just reported such an issue with Qwen3-Next precision loss with flashinfer 0.4.0 (flashinfer-ai/flashinfer#1931). I presume the problem is within flashinfer. As for gpt-oss, it seems to work fine with tool calling on my setup.

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

elvischenv requested a review from mgoin as a code owner October 7, 2025 01:11

mergify bot added ci/build v1 labels Oct 7, 2025

gemini-code-assist bot reviewed Oct 7, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 7, 2025

View reviewed changes

elvischenv force-pushed the elvischenv/update-flashinfer branch 2 times, most recently from c87bd4c to 781d7b3 Compare October 7, 2025 01:50

elvischenv mentioned this pull request Oct 7, 2025

[Flashinfer][gpt-oss] Support FP8-qkv Flashinfer TRTLLM Sinks Attention #25674

Merged

5 tasks

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025

elvischenv force-pushed the elvischenv/update-flashinfer branch from 781d7b3 to aa79a76 Compare October 7, 2025 05:51

elvischenv requested review from WoosukKwon, tlrmchlsmth and yewentao256 as code owners October 7, 2025 05:51

elvischenv force-pushed the elvischenv/update-flashinfer branch from aa79a76 to e314004 Compare October 7, 2025 07:34

elvischenv requested a review from robertgshaw2-redhat as a code owner October 7, 2025 07:34

elvischenv commented Oct 7, 2025

View reviewed changes

pavanimajety reviewed Oct 7, 2025

View reviewed changes

elvischenv commented Oct 7, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

mgoin approved these changes Oct 7, 2025

View reviewed changes

This was referenced Oct 7, 2025

[unrevert] Add batch invariant kernel override for FlashInfer backend [2/n] #26373

Merged

Bump flashinfer to 0.4.0rc2 to support determinism #26306

Closed

elvischenv added 3 commits October 8, 2025 18:57

bump flashinfer to v0.4.0rc4

a7bd562

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

fix flashinfer unit test

d4d7b90

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

extend timeout

b0fd67c

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

Apply suggestion from @elvischenv

629cada

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/update-flashinfer branch from 109a01a to a05c432 Compare October 9, 2025 02:09

elvischenv changed the title ~~Bump Flashinfer to v0.4.0rc4~~ Bump Flashinfer to v0.4.0 Oct 9, 2025

update to 0.4.0

17e62c6

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

elvischenv force-pushed the elvischenv/update-flashinfer branch from a05c432 to 17e62c6 Compare October 9, 2025 02:10

vllm-bot merged commit 5e49c3e into vllm-project:main Oct 9, 2025
83 of 86 checks passed

elvischenv deleted the elvischenv/update-flashinfer branch October 9, 2025 07:31

This was referenced Oct 9, 2025

upgrade flashinfer to v0.4.0rc1 #25315

Closed

[Bug]: Mismatched number of arguments #25929

Open

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025

Bump Flashinfer to v0.4.0 (vllm-project#26326)

e2d3f6e

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025

Bump Flashinfer to v0.4.0 (vllm-project#26326)

fc9100d

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

Bump Flashinfer to v0.4.0 (vllm-project#26326)

04f1675

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

Bump Flashinfer to v0.4.0 (vllm-project#26326)

b96eff1

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>

xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025

Bump Flashinfer to v0.4.0 (vllm-project#26326)

5cf769d

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025

Bump Flashinfer to v0.4.0 (vllm-project#26326)

a325146

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

	if [[ "${CUDA_VERSION}" == 12.8.* ]] && [ "$TARGETPLATFORM" = "linux/amd64" ]; then
	# NOTE: To make new precompiled wheels, see tools/flashinfer-build.sh
	echo "🏗️ Installing FlashInfer from pre-compiled wheel"
	uv pip install --system https://wheels.vllm.ai/flashinfer-python/flashinfer_python-0.3.1-cp39-abi3-manylinux1_x86_64.whl \
	--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION \| cut -d. -f1,2 \| tr -d '.')

Uh oh!

Bump Flashinfer to v0.4.0 #26326

Bump Flashinfer to v0.4.0 #26326

Uh oh!

Conversation

elvischenv commented Oct 7, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

elvischenv commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Oct 7, 2025

Uh oh!

elvischenv commented Oct 7, 2025

Uh oh!

elvischenv Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

elvischenv Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

pavanimajety Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elvischenv commented Oct 8, 2025

Uh oh!

yewentao256 commented Oct 8, 2025

Uh oh!

Uh oh!

bbrowning commented Oct 10, 2025

Uh oh!

elvischenv commented Oct 11, 2025

Uh oh!

bbrowning commented Oct 13, 2025

Uh oh!

Steven0236 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

elvischenv commented Oct 7, 2025 •

edited by github-actions bot

Loading

elvischenv commented Oct 7, 2025 •

edited

Loading

elvischenv Oct 7, 2025 •

edited

Loading

mgoin commented Oct 7, 2025 •

edited

Loading