-
-
Notifications
You must be signed in to change notification settings - Fork 10.9k
[V1] Use FlashInfer by default on Blackwell GPUs #19118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Use FlashInfer by default on Blackwell GPUs #19118
Conversation
Signed-off-by: mgoin <mgoin64@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @mgoin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
Summary of Changes
As gemini-code-assist, I've reviewed this pull request. The primary goal is to enhance performance on NVIDIA Blackwell GPUs by making FlashInfer the default attention backend when it's available. The author notes that FlashInfer has a specific, more performant backend for Blackwell compared to FlashAttention2 (FA3 is unsupported on V1). This change modifies the backend selection logic within vllm/platforms/cuda.py to prioritize FlashInfer for Blackwell architectures (compute capability 10.x or 100), falling back to other available backends if FlashInfer is not installed.
Highlights
- Prioritize FlashInfer on Blackwell: FlashInfer is now the preferred attention backend for NVIDIA Blackwell GPUs (compute capability 10.x / 100) in both V1 and V2 engine paths.
- Conditional Selection: The FlashInfer backend is only selected if the
flashinferlibrary is successfully imported. If the import fails, the system falls back to the existing backend selection logic (e.g., Flash Attention). - Backend Selection Logic Update: The
get_attn_backend_clsfunction invllm/platforms/cuda.pyis updated to check for Blackwell capability before other architectures like Ampere+.
Changelog
- vllm/platforms/cuda.py
- Modified the
get_attn_backend_clsfunction to add checks for NVIDIA Blackwell GPUs (DeviceCapability 10,0 for V1 engine and 100 for V2 engine). - Introduced logic to attempt importing the
flashinferlibrary when a Blackwell GPU is detected. - If
flashinferis successfully imported on a Blackwell GPU, theFlashInferBackendis returned. - Included exception handling for the
flashinferimport to ensure graceful fallback to other backends if FlashInfer is not installed.
- Modified the
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Blackwell's power shines,
FlashInfer speeds up the lines,
Fast attention flies.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively enables FlashInfer by default for Blackwell GPUs, which is a good performance enhancement. The changes are well-targeted within vllm/platforms/cuda.py.
I have a couple of suggestions primarily around error handling and logging to improve debuggability, particularly if flashinfer fails to import or initialize.
Summary of Findings
- Logging for FlashInfer Import Failure (V1 Engine): The V1 engine path currently silently fails over if
flashinferimport raises an exception. Adding a log message would improve debuggability. - Exception Specificity for FlashInfer Import: Both V1 and V2 paths use a broad
except Exception. Consider usingexcept ImportError:if that's the primary expected failure, or adjusting log messages if other import-time exceptions fromflashinferitself are meant to be caught. This enhances clarity and helps in diagnosing issues. - Consistency in GPU Capability Check: The V1 engine path uses
cls.get_device_capability() == DeviceCapability(10, 0)while the V2 path usescls.has_device_capability(100)to check for Blackwell GPUs. Both are functionally equivalent. UsingDeviceCapability(10, 0)is arguably more explicit, but the current approach is also acceptable. This was not commented on directly due to review settings.
Merge Readiness
The pull request is well-focused and addresses the goal of enabling FlashInfer for Blackwell GPUs. However, there are a few medium-severity suggestions regarding logging and exception handling that would improve the robustness and maintainability of the code. I recommend addressing these points before merging. As an AI, I am not authorized to approve pull requests; please ensure further review and approval from the maintainers.
Signed-off-by: mgoin <mgoin64@gmail.com>
…ralmagic/vllm into blackwell-default-flashinfer
Signed-off-by: mgoin <mgoin64@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why a new function rather than using get_device_capability() == 100?
|
@youkaichao |
Purpose
FlashInfer has a specific backend for NVIDIA Blackwell so it is much more performant than FlashAttention2 default in V1 (FA3 is unsupported). If a user has it installed, I think we should choose it by default, which this PR achieves. See this comment for benchmarks #18095 (comment)
This PR also adds
is_device_capabilityto the platform interface for exact cc checking since we only want this for SM 10.0Test Plan
Test locally on a B200 that FlashInfer is enabled by default
Test Result
On B200 without flashinfer installed:
On B200 with flashinfer installed: