-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[UX] Add FlashInfer as default CUDA dependency #26443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to make FlashInfer a default CUDA dependency, which is a great step for user experience. My main concern is with the change in vllm/utils/flashinfer.py, which could negatively impact users in environments without the full CUDA toolkit. Please see my detailed comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting
Signed-off-by: mgoin <mgoin64@gmail.com>
|
Do we still need the flashinfer compilation at all? Flashinfer as of today releases a nightly wheel with full precompiled cubins. |
|
@bbartels We will start using the precompiled wheels ( |
|
That's fair, might not be a good idea to require them for the vllm wheel. But I could see it be worth including it by default in the Dockerfile, since that would unblock moving back to the substantially smaller nvidia base image. |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
yewentao256
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the work!
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Dhruvil Bhatt <bhattdbh@amazon.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: xuebwang-amd <xuebwang@amd.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Summary: vllm-project#26443 adds checking of availability of nvcc as a condition to enable flashinfer moe. On devgpus, we may have nvcc so there is no issue. But in tw jobs, there is no nvcc, then flashinfer moe is disabled. Differential Revision: D86104899
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Summary: vllm-project#26443 adds checking of availability of nvcc as a condition to enable flashinfer moe. On devgpus, we may have nvcc so there is no issue. But in tw jobs, there is no nvcc, then flashinfer moe is disabled. Differential Revision: D86104899
…-project#27990) Summary: vllm-project#26443 adds checking of availability of nvcc as a condition to enable flashinfer moe. On devgpus, we may have nvcc so there is no issue. But in tw jobs, there is no nvcc, then flashinfer moe is disabled. Differential Revision: D86104899 Signed-off-by: Xiaozhu <mxz297@gmail.com>
Signed-off-by: mgoin <mgoin64@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Purpose
It seems that FlashInfer does not require
nvccto installed from source anymore sinceflashinfer-python>=0.2.9, so we can move it to be a default dependency!Obviously to have it JIT compile kernels it needs to have
nvccavailable, so we will currently add that condition tohas_flashinfer(). This is more conservative than it needs to be, as if we have an AOT compiled wheel installed there is likely no need fornvcc, but we can improve this later.Also with flashinfer-python==0.4.0, we can now move to using flashinfer-cubin and flashinfer-jit-cache wheels instead of pre-compiling ourselves in the docker!
Test Plan
Test Result
I used a uv python image to simulate installing flashinfer-python without any cuda toolkit available
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.