[UX] Add FlashInfer as default CUDA dependency #26443

mgoin · 2025-10-08T21:25:26Z

Purpose

It seems that FlashInfer does not require nvcc to installed from source anymore since flashinfer-python>=0.2.9, so we can move it to be a default dependency!

Obviously to have it JIT compile kernels it needs to have nvcc available, so we will currently add that condition to has_flashinfer(). This is more conservative than it needs to be, as if we have an AOT compiled wheel installed there is likely no need for nvcc, but we can improve this later.

Also with flashinfer-python==0.4.0, we can now move to using flashinfer-cubin and flashinfer-jit-cache wheels instead of pre-compiling ourselves in the docker!

Test Plan

Test Result

I used a uv python image to simulate installing flashinfer-python without any cuda toolkit available

docker run -it --rm astral/uv:bookworm-slim bash
root@bbbb3b887d99:/# uv venv --python 3.12
root@bbbb3b887d99:/# source .venv/bin/activate
(.venv) root@bbbb3b887d99:/# uv pip install flashinfer-python==0.2.8
...
      FileNotFoundError: [Errno 2] No such file or directory: 'nvcc'

(.venv) root@bbbb3b887d99:/# uv pip install flashinfer-python==0.2.9
Installed 41 packages in 451ms

(.venv) root@bbbb3b887d99:/# uv pip install flashinfer-python==0.3.1
Installed 5 packages in 103ms

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: mgoin <mgoin64@gmail.com>

gemini-code-assist

Code Review

This pull request aims to make FlashInfer a default CUDA dependency, which is a great step for user experience. My main concern is with the change in vllm/utils/flashinfer.py, which could negatively impact users in environments without the full CUDA toolkit. Please see my detailed comment.

vllm/utils/flashinfer.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

vllm/utils/flashinfer.py

Signed-off-by: mgoin <mgoin64@gmail.com>

bbartels · 2025-10-08T21:47:37Z

Do we still need the flashinfer compilation at all? Flashinfer as of today releases a nightly wheel with full precompiled cubins.

mgoin · 2025-10-08T22:20:52Z

@bbartels We will start using the precompiled wheels (flashinfer-jit-cache and flashinfer-cubins) in the upcoming flashinfer 0.4.0 release. However I doubt we will include them by default since they don't live on pypi and are ~1 GB, which is a sizable increase for all users

Signed-off-by: mgoin <mgoin64@gmail.com>

bbartels · 2025-10-08T22:31:45Z

That's fair, might not be a good idea to require them for the vllm wheel. But I could see it be worth including it by default in the Dockerfile, since that would unblock moving back to the substantially smaller nvidia base image.
(Unless I am mistaken)

mergify · 2025-10-09T08:34:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @mgoin.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

requirements/cuda.txt

Signed-off-by: mgoin <mgoin64@gmail.com>

yewentao256

LGTM, thanks for the work!

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

Summary: vllm-project#26443 adds checking of availability of nvcc as a condition to enable flashinfer moe. On devgpus, we may have nvcc so there is no issue. But in tw jobs, there is no nvcc, then flashinfer moe is disabled. Differential Revision: D86104899

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Summary: vllm-project#26443 adds checking of availability of nvcc as a condition to enable flashinfer moe. On devgpus, we may have nvcc so there is no issue. But in tw jobs, there is no nvcc, then flashinfer moe is disabled. Differential Revision: D86104899

…-project#27990) Summary: vllm-project#26443 adds checking of availability of nvcc as a condition to enable flashinfer moe. On devgpus, we may have nvcc so there is no issue. But in tw jobs, there is no nvcc, then flashinfer moe is disabled. Differential Revision: D86104899 Signed-off-by: Xiaozhu <mxz297@gmail.com>

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Add FlashInfer as default CUDA dependency

9a6f772

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot added the ci/build label Oct 8, 2025

gemini-code-assist bot reviewed Oct 8, 2025

View reviewed changes

vllm/utils/flashinfer.py Outdated Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 8, 2025

View reviewed changes

vllm/utils/flashinfer.py Outdated Show resolved Hide resolved

Update dockerfile

fb25fc3

Signed-off-by: mgoin <mgoin64@gmail.com>

Add logs

4e020d8

Signed-off-by: mgoin <mgoin64@gmail.com>

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 9, 2025

mergify bot added the needs-rebase label Oct 9, 2025

jasl reviewed Oct 9, 2025

View reviewed changes

requirements/cuda.txt Outdated Show resolved Hide resolved

mgoin added 2 commits October 9, 2025 10:42

Merge branch 'main' into flashinfer-by-default

5f8b740

Signed-off-by: mgoin <mgoin64@gmail.com>

Update dockerfile

a70d88d

Signed-off-by: mgoin <mgoin64@gmail.com>

mergify bot removed the needs-rebase label Oct 9, 2025

jasl mentioned this pull request Oct 9, 2025

[Bug]: Mismatched number of arguments #25929

Open

1 task

mgoin mentioned this pull request Oct 9, 2025

[Bugfix] Enforce installed FlashInfer version to match optional requirement #24671

Closed

5 tasks

yewentao256 approved these changes Oct 9, 2025

View reviewed changes

yewentao256 enabled auto-merge (squash) October 9, 2025 20:54

Merge branch 'main' into flashinfer-by-default

5f324f8

vllm-bot merged commit c9d33c6 into vllm-project:main Oct 9, 2025
8 of 13 checks passed

mgoin deleted the flashinfer-by-default branch October 9, 2025 21:10

Isotr0py mentioned this pull request Oct 14, 2025

[UX] Fallback to native implementation when flashinfer sampler failed to compile #26799

Open

5 tasks

jeejeelee mentioned this pull request Oct 16, 2025

[Kernel] Lazy import FlashInfer #26977

Merged

5 tasks

lywa1998 pushed a commit to lywa1998/vllm that referenced this pull request Oct 20, 2025

[UX] Add FlashInfer as default CUDA dependency (vllm-project#26443)

2066741

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

alhridoy pushed a commit to alhridoy/vllm that referenced this pull request Oct 24, 2025

[UX] Add FlashInfer as default CUDA dependency (vllm-project#26443)

484ada9

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

mxz297 mentioned this pull request Nov 3, 2025

[flashinfer][fix] do not check nvcc availability when using pre-downloaded cubins #27990

Merged

larroy pushed a commit to larroy/vllm that referenced this pull request Nov 6, 2025

[UX] Add FlashInfer as default CUDA dependency (vllm-project#26443)

609a0e4

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

rtourgeman pushed a commit to rtourgeman/vllm that referenced this pull request Nov 10, 2025

[UX] Add FlashInfer as default CUDA dependency (vllm-project#26443)

13e87d3

Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[UX] Add FlashInfer as default CUDA dependency #26443

[UX] Add FlashInfer as default CUDA dependency #26443

Uh oh!

mgoin commented Oct 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

bbartels commented Oct 8, 2025

Uh oh!

mgoin commented Oct 8, 2025 •

edited

Loading

Uh oh!

bbartels commented Oct 8, 2025 •

edited

Loading

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

[UX] Add FlashInfer as default CUDA dependency #26443

[UX] Add FlashInfer as default CUDA dependency #26443

Uh oh!

Conversation

mgoin commented Oct 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

bbartels commented Oct 8, 2025

Uh oh!

mgoin commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbartels commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 9, 2025

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mgoin commented Oct 8, 2025 •

edited by github-actions bot

Loading

mgoin commented Oct 8, 2025 •

edited

Loading

bbartels commented Oct 8, 2025 •

edited

Loading