-
-
Notifications
You must be signed in to change notification settings - Fork 11.3k
[Kernel] Add tune-able block sizes to unified 2d triton kernel #20690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @bringlein, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces dynamic, tune-able block sizes to the unified 2D Triton attention kernel, allowing it to adapt its computational strategy based on input sequence lengths. This enhancement leads to substantial performance improvements, particularly for prefill operations, and establishes a foundation for future fine-grained kernel optimizations.
Highlights
- Tune-able Block Sizes: The unified 2D Triton attention kernel (
kernel_unified_attention_2d) now supports tune-ableBLOCK_M(query block size) andBLOCK_N(key/value block size) parameters, allowing for more flexible work distribution. - Dynamic Heuristics for Block Sizes: The
unified_attentionfunction now dynamically setsBLOCK_Mbased onmax_seqlen_qandBLOCK_Nbased onmax_seqlen_kusing simple heuristics. This enables the kernel to adapt its block sizes to different sequence lengths for improved performance. - Performance Enhancements: This optimization significantly improves prefill attention performance, showing up to a 78% latency reduction on H100 for specific benchmarks. It also provides modest gains (around 4%) for serving throughput and time-to-first-token (TTFT).
- Kernel Logic Refactoring: The internal logic of
kernel_unified_attention_2dhas been refactored to accommodate the new block sizes. This includes derivingBLOCK_Qinternally, replacing thefind_seq_idxhelper with an inline binary search, and adjusting tensor dimensions and loop iterations for attention scores and accumulators to align withBLOCK_QandBLOCK_N.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces tunable block sizes to the unified attention kernel, which is a great optimization that yields impressive performance improvements. The implementation is solid. My feedback focuses on enhancing code clarity and maintainability by simplifying some conditions and correcting comments. Overall, this is a valuable contribution.
Co-authored-by: Tom Parnell <tpa@zurich.ibm.com> Co-authored-by: Jan Van Lunteren <jvl@zurich.ibm.com> Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
4923046 to
e99a139
Compare
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
70146a4 to
3556c2a
Compare
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
|
While preparing the PR, I merged with an old version of the kernel (see above). I fixed it now, sorry for that. Neither performance nor correctness were affected: Updated latency (including #18100)Updated performance (including #18100)Updated correctness (including #18100) |
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
|
After feedback from @SageMoore, I tried to balance the preliminary heuristics used for |
|
Hi, @bringlein We discussed this a bit offline but it looks like theres a slight regression in performance at higher qps rates. Here's some results that were collected on main and your PR with the following script. Results from main Results from this PR It's not clear to me if the performance tradeoffs are worth it in this case. Obviously the long prefill improvements are great so this may just be a tradeoff that we want to make CC: @gshtras @robertgshaw2-redhat |
This PR adds the tunable 2d attention kernel (similar to vllm-project/vllm#20690) with tuning using the micro-benchmarks already carried out for H100 and MI300. --------- Signed-off-by: Burkhard Ringlein <ngl@zurich.ibm.com>
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has been automatically closed due to inactivity. Please feel free to reopen if you intend to continue working on it. Thank you! |
This PR introduces tune-able block sizes to the unified attention kernel that enhances prefill attention performance. For now, we use simple heuristics to determine the right block sizes, but we intend to tune them more for targeted platforms in the very near future.
Performance
benchmark_latency.pyOn H100, with this PR:
on H100, current upstream using the
triton_attnbackend:So, this PR decreases the latency of prefill by 78%.
benchmark_serving.pyOn H100, with this PR:
before, on H100 with current upstream using the
triton_attnbackend:So, also here, this PR improves throughput, TTFT, and ITL about 4%.
Correctness
With this PR on an H100:
More Context
The optimization allows the unified attention kernel to adapt the distribution of the work across compute units / streaming multiprocessors depending on the length of the requests in a batch.
While, depending on the use-case, the performance increases are only modest (4%), this PR enables more fine-grained tuning of the unified attention kernel in the future. E.g. already with the vary basic heuristic, the performance increase is 78% for prefill-heavy latency use cases.
CC @jvlunteren @tdoublep @SageMoore