-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
Bump Flashinfer to v0.4.0 #26326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump Flashinfer to v0.4.0 #26326
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
I have reviewed the changes for bumping Flashinfer to v0.4.0rc4. The updates across the Dockerfiles and setup.py are consistent with the version bump. The necessary API adaptation in vllm/v1/attention/backends/flashinfer.py to support the new version of FlashInfer appears correct. I did not find any issues of high or critical severity. The changes look good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Lines 395 to 399 in 7cf8f24
| if [[ "${CUDA_VERSION}" == 12.8.* ]] && [ "$TARGETPLATFORM" = "linux/amd64" ]; then | |
| # NOTE: To make new precompiled wheels, see tools/flashinfer-build.sh | |
| echo "🏗️ Installing FlashInfer from pre-compiled wheel" | |
| uv pip install --system https://wheels.vllm.ai/flashinfer-python/flashinfer_python-0.3.1-cp39-abi3-manylinux1_x86_64.whl \ | |
| --extra-index-url ${PYTORCH_CUDA_INDEX_BASE_URL}/cu$(echo $CUDA_VERSION | cut -d. -f1,2 | tr -d '.') |
The Docker build path for CUDA 12.8 on amd64 still installs flashinfer_python-0.3.1 while the rest of this commit bumps the dependency to v0.4.0rc4. Building an image through this branch will ship the old 0.3.1 wheel but the Python code now calls the 0.4 APIs (e.g. additional plan arguments), so the container will crash at runtime once those symbols are invoked. The precompiled wheel URL needs to be updated to the new version to keep the binary compatible with the codebase.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
|
@mgoin Do we have pre-compiled Flashinfer 0.4.0rc4 wheel? Lines 395 to 399 in c50901f
|
|
@elvischenv not right now, I can make one tomorrow. For now just update the precompiled wheel condition to check for v0.3.1 as well, or skip |
c87bd4c to
781d7b3
Compare
@mgoin Thanks! Added check for v0.3.1 for now. |
781d7b3 to
aa79a76
Compare
aa79a76 to
e314004
Compare
| model, | ||
| server_args, | ||
| max_wait_seconds=1000, # Due to FlashInfer compile | ||
| max_wait_seconds=1500, # Due to FlashInfer compile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nightly took about 13 min to pass:
https://buildkite.com/vllm/ci/builds/33777/steps/canvas?sid=0199bcd4-8cfa-440e-82e2-568249d7e5f5#0199bcd4-8ddc-4310-b617-e51d72fa6264/87-1271
new flashinfer(this PR) failed in 16 min when loading cubin:
https://buildkite.com/vllm/ci/builds/33787/steps/canvas?sid=0199bd3a-b391-4c5a-8bae-e778a6a71916#0199bd3a-b4ab-4dd6-b108-d0e3659e9a33/90-1150
Try extending from 1000s(16min) to 1500s(25min)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After extending the timeout, seems it took 17min to complete:
https://buildkite.com/vllm/ci/builds/33799/steps/canvas?sid=0199bd99-315d-4e83-9c07-a0866e8dce1d#0199bd99-3382-426a-9967-c454fce447c4/100-1208
| False, # causal | ||
| window_left, | ||
| -1, | ||
| False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please add a comment for what -1 and False stand for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just pushed a fix for the renamed mxfp4 moe test
|
@elvischenv It looks like the failure in lm-eval for deepseek-coder-v2-lite is related, PTAL https://buildkite.com/vllm/ci/builds/33858/steps/canvas?sid=0199bf8f-5df9-47f7-bf22-76ff5669357c Confirmed that this test was green last night https://buildkite.com/vllm/ci/builds/33777/steps/canvas?sid=0199bcd4-8cfb-452f-9053-9fba4c858dae Ran the eval locally fine with 0.3.1 and crashing with 0.4.0rc4. It seems consistently to be due to My crash was during mla |
Hi @mgoin, this issue has been fixed on flashinfer main ToT. Let's wait for the next flashinfer release. |
Hi @elvischenv do you know when will it be released? |
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
109a01a to
a05c432
Compare
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
a05c432 to
17e62c6
Compare
…to loader * 'loader' of https://github.com/dsxsteven/vllm_splitPR: (778 commits) [torchao] Add support for ModuleFqnToConfig using regex (vllm-project#26001) Add: Support for multiple hidden layers in Eagle3 (vllm-project#26164) Enable `RMSNorm` substitution for Transformers backend (vllm-project#26353) [Model] Gemma3: Fix GGUF loading and quantization (vllm-project#26189) Bump Flashinfer to v0.4.0 (vllm-project#26326) Update Dockerfile and install runai-model-streamer[gcs] package (vllm-project#26464) [Core] Relax the LoRA max rank (vllm-project#26461) [CI/Build] Fix model nightly tests (vllm-project#26466) [Hybrid]: Decouple Kernel Block Size from KV Page Size (vllm-project#24486) [Core][KVConnector] Propagate all tokens on resumed preemptions (vllm-project#24926) [MM][Doc] Add documentation for configurable mm profiling (vllm-project#26200) [Hardware][AMD] Enable FlexAttention backend on ROCm (vllm-project#26439) [Bugfix] Incorrect another MM data format in vllm bench throughput (vllm-project#26462) [Bugfix] Catch and log invalid token ids in detokenizer #2 (vllm-project#26445) [Minor] Change warning->warning_once in preprocess (vllm-project#26455) [Bugfix] Set the minimum python version for gpt-oss (vllm-project#26392) [Misc] Redact ray runtime env before logging (vllm-project#26302) Separate MLAAttention class from Attention (vllm-project#25103) [Attention] Register FLASHMLA_SPARSE (vllm-project#26441) [Kernels] Modular kernel refactor (vllm-project#24812) ...
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
|
Anecdotally, starting to see some reports of gpt-oss weirdness with flashinfer 0.4.0 reported by users. An example at #24954 (comment) that I'll link here, just in case there is a pattern others see. |
The results of |
I have not been able to reproduce this myself yet, as I don't use vLLM within containers nor with flashinfer. It's not clear flashinfer is an issue in the linked report, but it was at least suspected as the culprit because disabling it fixed an infinite generation loop with gpt-oss models, at least for that user. |
|
On the topic of weirdness after upgrading to flashinfer 0.4.0, I just reported such an issue with Qwen3-Next precision loss with flashinfer 0.4.0 (flashinfer-ai/flashinfer#1931). I presume the problem is within flashinfer. As for gpt-oss, it seems to work fine with tool calling on my setup. |
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: elvischenv <219235043+elvischenv@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Purpose
Bump Flashinfer to v0.4.0.
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.