-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer #20934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Switch bailout logic for kv-cache-dtype with SM100 Flashinfer #20934
Conversation
…r Backend Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @pavanimajety, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request addresses a bug in the kv-cache-dtype auto-detection logic, specifically for SM100 GPUs. It refines how the system determines support for FP8 KV cache, ensuring that users on SM100 devices can leverage kv-cache-dtype=auto without needing to manually specify the FlashInfer backend, thereby improving usability and correctness.
Highlights
- Bugfix for KV-cache dtype support on SM100: I've reordered the logic for determining
kv-cache-dtype=fp8support to correctly identify SM100 GPUs. This ensures thatkv-cache-dtype=autofunctions as expected without requiring theVLLM_ATTENTION_BACKEND=FLASHINFERenvironment variable. - Improved
kv-cache-dtypeauto-detection: The change ensures that the system properly recognizes and enables FP8 KV cache on SM100 devices, streamlining the user experience and preventing unnecessary configuration requirements that previously led to bailouts.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request modifies the logic for detecting FP8 support on SM100 GPUs, removing the requirement for a specific environment variable. The change improves the accuracy of device capability detection.
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
|
I just realized this isn't right either as we need to know that we will be using the FlashInfer backend.. Maybe we could check if FlashInfer is installed and VLLM_ATTENTION_BACKEND is set to FI or unset? |
This path should work because we no longer bail out and proceed with general check based on platform and set the backend to Flashinfer. Why do we need to check if flashinfer is installed or not at this stage? |
|
My concern is if we are on Blackwell and don't have FlashInfer installed on V1, then we are stuck using FA2 which doesn't support FP8 kv cache. I guess if the user has already set |
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: x22x22 <wadeking@qq.com>
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Jinzhen Lin <linjinzhen@hotmail.com>
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Paul Pak <paulpak58@gmail.com>
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com> Signed-off-by: Diego-Castan <diego.castan@ibm.com>
vllm-project#20934) Signed-off-by: Pavani Majety <pmajety@nvidia.com>
This fix first checks if the current device is SM100 to validate whether kv-cache-dtype=fp8 is supported.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.Purpose
Removes the necessity for specifying
VLLM_ATTENTION_BACKEND=FLASHINFERwhenkv-cache-dtype=autois used on SM100 GPUsTest Plan
The accuracy itself is not affected in this PR, it only fixes the bail out condition for SM100 device capability. Without this PR, when the
elifbranch determines supported=False, it doesn't evaluate otherifbranches.Test Result
Before:
and
kv-cache-dtype="fp8"After:
kv_cache_dtype="auto"kv_cache_dtype="fp8"
(Optional) Documentation Update