-
Notifications
You must be signed in to change notification settings - Fork 581
[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. #1290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @happierpig, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on optimizing the VariableBlockSparseAttentionWrapper by offloading its planning phase to the GPU, which addresses a significant performance bottleneck. Additionally, it enhances robustness by increasing internal buffer capacities to support larger models like video DiT and introduces explicit buffer overflow warnings to guide users in case of insufficient memory allocation.
Highlights
- Performance Optimization (GPU-side Planning): The
planfunction withinVariableBlockSparseAttentionWrapperhas been moved to execute on the GPU. This change aims to significantly reduce host-side overhead, which was previously identified as an expensive operation taking hundreds of milliseconds, thereby improving overall performance. - Buffer Size Increase for Video DiT Models: The default internal buffer size for
_vector_sparse_indices_bufferin bothBlockSparseAttentionWrapperandVariableBlockSparseAttentionWrapperhas been substantially increased from 4MB to 128MB. This adjustment is specifically made to accommodate the larger memory requirements of video DiT (Diffusion Transformer) models. - Buffer Overflow Warnings: New
ValueErrorchecks have been added to therunandplanmethods. These checks will now proactively warn users if the internal_vector_sparse_indices_bufferor_vector_sparse_indptr_bufferare not sufficiently large to handle the current workload, preventing potential silent failures or unexpected behavior. - Test Suite Enhancements: The test suite for block sparse attention has been updated to include
set_seedfor reproducibility and to ensure that input tensors for variable block sparse attention tests are created directly on the GPU (cuda:0), aligning with the new GPU-side planning logic.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request moves the planning logic to the GPU for VariableBlockSparseAttentionWrapper which should help with performance by reducing host-side overhead. The increased buffer size and the new buffer overflow warnings are also valuable additions for robustness. The review focuses on refining the logic of the new buffer overflow checks to prevent potential false positives and improving their error messages for better debuggability.
| if ( | ||
| self._vector_sparse_indices_buffer.numel() | ||
| <= self._paged_kv_indices_buf.numel() * self.C | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition self._vector_sparse_indices_buffer.numel() <= self._paged_kv_indices_buf.numel() * self.C will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.
| if ( | |
| self._vector_sparse_indices_buffer.numel() | |
| <= self._paged_kv_indices_buf.numel() * self.C | |
| ): | |
| if ( | |
| self._vector_sparse_indices_buffer.numel() | |
| < self._paged_kv_indices_buf.numel() * self.C | |
| ): |
| if self._backend == "fa3": | ||
| self._vector_sparse_indptr_buffer[: len(kv_indptr_host)].copy_( | ||
| kv_indptr_host, non_blocking=non_blocking | ||
| if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel() will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.
| if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel(): | |
| if self._vector_sparse_indptr_buffer.numel() < kv_indptr.numel(): |
| if ( | ||
| self._vector_sparse_indices_buffer.numel() | ||
| <= self._paged_kv_indices_buf.numel() | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition self._vector_sparse_indices_buffer.numel() <= self._paged_kv_indices_buf.numel() will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.
| if ( | |
| self._vector_sparse_indices_buffer.numel() | |
| <= self._paged_kv_indices_buf.numel() | |
| ): | |
| if ( | |
| self._vector_sparse_indices_buffer.numel() | |
| < self._paged_kv_indices_buf.numel() | |
| ): |
|
@yzh119 Would you mind reviewing this? This should be ready for review. |
|
Thanks @happierpig for this great feature. We can generate near-identical videos when applying sparse attention on video diffusion models using this API. Here's the generated video: 102-0.04flashinfer-300M.mp4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, we should refactor the attention wrapper and plan interface in later PRs, more specifically:
- run ahead-of-time tile scheduler on gpu instead of cpu.
- avoid the data movement in plan functions, let user instead of wrapper to manage the cudagraph-safe buffers.
| self._paged_kv_last_page_len = last_block_len.to( | ||
| self.device, non_blocking=non_blocking | ||
| ) | ||
| torch.cuda.synchronize() # for non-blocking copy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the assumption is input tensors are device tensors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose if input tensor is host tensor, then we can totally avoid it.
|
Great thanks to @happierpig and @yzh119 for supporting this. We firmly believe this will further push the boundaries of efficient attention in domains like video generation. |
|
hi @haochengxi, could you share the code you used to generate this video? Would also like to have a try. Thanks. |
|
Hi @happierpig, in my test, when I set flashinfer/flashinfer/sparse.py Lines 1187 to 1189 in 17978a3
Could you have a robust way to deal with larger number of blocks when setting _vector_sparse_indices_buffer? Thanks!
|
📌 Description
planfunction inVariableBlockSparseAttentionWrapperto the GPU side, to avoid expensive (hundreds ms) host operations.customized_maskmode of FA2 prefill template. E.g., with akv_len=128K, the last element of the attention map will be128*128*1e6=1e10, which is larger thanINT32_MAX.🔍 Related Issues
This PR should solve #1271
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes