Skip to content

Conversation

@elvischenv
Copy link
Contributor

@elvischenv elvischenv commented Oct 7, 2025

Purpose

Bump Flashinfer to v0.4.0.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

I have reviewed the changes for bumping Flashinfer to v0.4.0rc4. The updates across the Dockerfiles and setup.py are consistent with the version bump. The necessary API adaptation in vllm/v1/attention/backends/flashinfer.py to support the new version of FlashInfer appears correct. I did not find any issues of high or critical severity. The changes look good to me.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

vllm/docker/Dockerfile

Lines 395 to 399 in 7cf8f24

if [[ "${CUDA_VERSION}" == 12.8.* ]] && [ "$TARGETPLATFORM" = "linux/amd64" ]; then
# NOTE: To make new precompiled wheels, see tools/flashinfer-build.sh
echo "🏗️ Installing FlashInfer from pre-compiled wheel"
uv pip install --system https://wheels.vllm.ai/flashinfer-python/flashinfer_python-0.3.1-cp39-abi3-manylinux1_x86_64.whl \
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

P1 Badge Update precompiled FlashInfer wheel version

The Docker build path for CUDA 12.8 on amd64 still installs flashinfer_python-0.3.1 while the rest of this commit bumps the dependency to v0.4.0rc4. Building an image through this branch will ship the old 0.3.1 wheel but the Python code now calls the 0.4 APIs (e.g. additional plan arguments), so the container will crash at runtime once those symbols are invoked. The precompiled wheel URL needs to be updated to the new version to keep the binary compatible with the codebase.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

@elvischenv
Copy link
Contributor Author

elvischenv commented Oct 7, 2025

@mgoin Do we have pre-compiled Flashinfer 0.4.0rc4 wheel?

vllm/docker/Dockerfile

Lines 395 to 399 in c50901f

if [[ "${CUDA_VERSION}" == 12.8.* ]] && [ "$TARGETPLATFORM" = "linux/amd64" ]; then
# NOTE: To make new precompiled wheels, see tools/flashinfer-build.sh
echo "🏗️ Installing FlashInfer from pre-compiled wheel"
uv pip install --system https://wheels.vllm.ai/flashinfer-python/flashinfer_python-0.3.1-cp39-abi3-manylinux1_x86_64.whl \
--extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.')

@mgoin
Copy link
Member

mgoin commented Oct 7, 2025

@elvischenv not right now, I can make one tomorrow. For now just update the precompiled wheel condition to check for v0.3.1 as well, or skip

@elvischenv elvischenv force-pushed the elvischenv/update-flashinfer branch 2 times, most recently from c87bd4c to 781d7b3 Compare October 7, 2025 01:50
@elvischenv
Copy link
Contributor Author

@elvischenv not right now, I can make one tomorrow. For now just update the precompiled wheel condition to check for v0.3.1 as well, or skip

@mgoin Thanks! Added check for v0.3.1 for now.

@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 7, 2025
@elvischenv elvischenv force-pushed the elvischenv/update-flashinfer branch from 781d7b3 to aa79a76 Compare October 7, 2025 05:51
@elvischenv elvischenv force-pushed the elvischenv/update-flashinfer branch from aa79a76 to e314004 Compare October 7, 2025 07:34
model,
server_args,
max_wait_seconds=1000, # Due to FlashInfer compile
max_wait_seconds=1500, # Due to FlashInfer compile
Copy link
Contributor Author

@elvischenv elvischenv Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

False, # causal
window_left,
-1,
False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please add a comment for what -1 and False stand for.

Copy link
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just pushed a fix for the renamed mxfp4 moe test

@mgoin
Copy link
Member

mgoin commented Oct 7, 2025

@elvischenv It looks like the failure in lm-eval for deepseek-coder-v2-lite is related, PTAL https://buildkite.com/vllm/ci/builds/33858/steps/canvas?sid=0199bf8f-5df9-47f7-bf22-76ff5669357c

FAILED evals/gsm8k/test_gsm8k_correctness.py::test_gsm8k_correctness_param[DeepSeek-V2-Lite-Instruct-FP8-tp1] - AssertionError: Accuracy too low: 0.000 < 0.720 - 0.080
assert 0.0 >= (0.72 - 0.08)

Confirmed that this test was green last night https://buildkite.com/vllm/ci/builds/33777/steps/canvas?sid=0199bcd4-8cfb-452f-9053-9fba4c858dae

Ran the eval locally fine with 0.3.1 and crashing with 0.4.0rc4. It seems consistently to be due to torch.OutOfMemoryError: CUDA out of memory
Command:

vllm serve RedHatAI/DeepSeek-Coder-V2-Lite-Instruct-FP8 --max-model-len 4096 --enforce-eager 
python tests/evals/gsm8k/gsm8k_eval.py

My crash was during mla

(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/attention/layer.py", line 695, in unified_attention_with_output
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     self.impl.forward(
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/v1/attention/backends/mla/common.py", line 1769, in forward
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     output[num_decode_tokens:] = self._forward_prefill(
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]                                  ^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/v1/attention/backends/mla/common.py", line 1657, in _forward_prefill
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     context_output, context_lse = self._compute_prefill_context(
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]   File "/home/mgoin/code/vllm/vllm/v1/attention/backends/mla/common.py", line 1483, in _compute_prefill_context
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]     k = torch.cat((k_nope, k_pe.expand((*k_nope.shape[:-1], -1))), dim=-1)
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=3473358) ERROR 10-07 16:34:21 [core.py:780] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 294.00 MiB. GPU 0 has a total capacity of 178.35 GiB of which 209.94 MiB is free. Including non-PyTorch memory, this process has 178.13 GiB memory in use. Of the allocated memory 177.09 GiB is allocated by PyTorch, and 227.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

@elvischenv
Copy link
Contributor Author

@elvischenv It looks like the failure in lm-eval for deepseek-coder-v2-lite is related, PTAL https://buildkite.com/vllm/ci/builds/33858/steps/canvas?sid=0199bf8f-5df9-47f7-bf22-76ff5669357c

Hi @mgoin, this issue has been fixed on flashinfer main ToT. Let's wait for the next flashinfer release.

@yewentao256
Copy link
Member

Hi @mgoin, this issue has been fixed on flashinfer main ToT. Let's wait for the next flashinfer release.

Hi @elvischenv do you know when will it be released?

Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
@elvischenv elvischenv force-pushed the elvischenv/update-flashinfer branch from 109a01a to a05c432 Compare October 9, 2025 02:09
@elvischenv elvischenv changed the title Bump Flashinfer to v0.4.0rc4 Bump Flashinfer to v0.4.0 Oct 9, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
@elvischenv elvischenv force-pushed the elvischenv/update-flashinfer branch from a05c432 to 17e62c6 Compare October 9, 2025 02:10
@vllm-bot vllm-bot merged commit 5e49c3e into vllm-project:main Oct 9, 2025
83 of 86 checks passed
@elvischenv elvischenv deleted the elvischenv/update-flashinfer branch October 9, 2025 07:31
845473182 pushed a commit to dsxsteven/vllm_splitPR that referenced this pull request Oct 10, 2025
…to loader

* 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits)
  [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001)
  Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164)
  Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353)
  [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189)
  Bump Flashinfer to v0.4.0 (vllm-project#26326)
  Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464)
  [Core] Relax the LoRA  max rank (vllm-project#26461)
  [CI/Build] Fix model nightly tests (vllm-project#26466)
  [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486)
  [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926)
  [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200)
  [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439)
  [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462)
  [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445)
  [Minor] Change warning->warning_once in preprocess (vllm-project#26455)
  [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392)
  [Misc] Redact ray runtime env before logging (vllm-project#26302)
  Separate MLAAttention class from Attention (vllm-project#25103)
  [Attention] Register FLASHMLA_SPARSE (vllm-project#26441)
  [Kernels] Modular kernel refactor (vllm-project#24812)
  ...
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 10, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
@bbrowning
Copy link
Contributor

Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see.

@elvischenv
Copy link
Contributor Author

Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see.

The results of gpt_oss.evals is still good after updating flashinfer to 0.4.0.
Do you have any repro steps? If it is actually caused by flashinfer 0.4.0, I could take a look.

@bbrowning
Copy link
Contributor

Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see.

The results of gpt_oss.evals is still good after updating flashinfer to 0.4.0. Do you have any repro steps? If it is actually caused by flashinfer 0.4.0, I could take a look.

I have not been able to reproduce this myself yet, as I don't use vLLM within containers nor with flashinfer. It's not clear flashinfer is an issue in the linked report, but it was at least suspected as the culprit because disabling it fixed an infinite generation loop with gpt-oss models, at least for that user.

@Steven0236
Copy link

On the topic of weirdness after upgrading to flashinfer 0.4.0, I just reported such an issue with Qwen3-Next precision loss with flashinfer 0.4.0 (flashinfer-ai/flashinfer#1931). I presume the problem is within flashinfer. As for gpt-oss, it seems to work fine with tool calling on my setup.

Dhruvilbhatt pushed a commit to Dhruvilbhatt/vllm that referenced this pull request Oct 14, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Oct 24, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: xuebwang-amd <xuebwang@amd.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants