Skip to content

Conversation

@happierpig
Copy link
Collaborator

@happierpig happierpig commented Jul 19, 2025

📌 Description

  1. Per discussion with @haochengxi and @Radioheading, this PR moves the plan function in VariableBlockSparseAttentionWrapper to the GPU side, to avoid expensive (hundreds ms) host operations.
  2. This PR also enlarges the default internal buffer size to accommodate video DiT use cases.
  3. This PR fixes the INT overflow during offset calculation in attention map. This causes errors in customized_mask mode of FA2 prefill template. E.g., with a kv_len=128K, the last element of the attention map will be 128*128*1e6=1e10, which is larger than INT32_MAX.

🔍 Related Issues

This PR should solve #1271

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

@happierpig happierpig requested a review from yzh119 July 19, 2025 00:18
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @happierpig, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on optimizing the VariableBlockSparseAttentionWrapper by offloading its planning phase to the GPU, which addresses a significant performance bottleneck. Additionally, it enhances robustness by increasing internal buffer capacities to support larger models like video DiT and introduces explicit buffer overflow warnings to guide users in case of insufficient memory allocation.

Highlights

  • Performance Optimization (GPU-side Planning): The plan function within VariableBlockSparseAttentionWrapper has been moved to execute on the GPU. This change aims to significantly reduce host-side overhead, which was previously identified as an expensive operation taking hundreds of milliseconds, thereby improving overall performance.
  • Buffer Size Increase for Video DiT Models: The default internal buffer size for _vector_sparse_indices_buffer in both BlockSparseAttentionWrapper and VariableBlockSparseAttentionWrapper has been substantially increased from 4MB to 128MB. This adjustment is specifically made to accommodate the larger memory requirements of video DiT (Diffusion Transformer) models.
  • Buffer Overflow Warnings: New ValueError checks have been added to the run and plan methods. These checks will now proactively warn users if the internal _vector_sparse_indices_buffer or _vector_sparse_indptr_buffer are not sufficiently large to handle the current workload, preventing potential silent failures or unexpected behavior.
  • Test Suite Enhancements: The test suite for block sparse attention has been updated to include set_seed for reproducibility and to ensure that input tensors for variable block sparse attention tests are created directly on the GPU (cuda:0), aligning with the new GPU-side planning logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request moves the planning logic to the GPU for VariableBlockSparseAttentionWrapper which should help with performance by reducing host-side overhead. The increased buffer size and the new buffer overflow warnings are also valuable additions for robustness. The review focuses on refining the logic of the new buffer overflow checks to prevent potential false positives and improving their error messages for better debuggability.

Comment on lines +593 to +596
if (
self._vector_sparse_indices_buffer.numel()
<= self._paged_kv_indices_buf.numel() * self.C
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition self._vector_sparse_indices_buffer.numel() <= self._paged_kv_indices_buf.numel() * self.C will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.

Suggested change
if (
self._vector_sparse_indices_buffer.numel()
<= self._paged_kv_indices_buf.numel() * self.C
):
if (
self._vector_sparse_indices_buffer.numel()
< self._paged_kv_indices_buf.numel() * self.C
):

if self._backend == "fa3":
self._vector_sparse_indptr_buffer[: len(kv_indptr_host)].copy_(
kv_indptr_host, non_blocking=non_blocking
if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel() will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.

Suggested change
if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel():
if self._vector_sparse_indptr_buffer.numel() < kv_indptr.numel():

Comment on lines +1150 to +1153
if (
self._vector_sparse_indices_buffer.numel()
<= self._paged_kv_indices_buf.numel()
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The condition self._vector_sparse_indices_buffer.numel() <= self._paged_kv_indices_buf.numel() will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.

Suggested change
if (
self._vector_sparse_indices_buffer.numel()
<= self._paged_kv_indices_buf.numel()
):
if (
self._vector_sparse_indices_buffer.numel()
< self._paged_kv_indices_buf.numel()
):

@happierpig happierpig changed the title [fix] move VariableBlockSparsePlan to GPU-side & add buffer overflow warning. [fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. Jul 19, 2025
@happierpig
Copy link
Collaborator Author

@yzh119 Would you mind reviewing this? This should be ready for review.

@haochengxi
Copy link

Thanks @happierpig for this great feature. We can generate near-identical videos when applying sparse attention on video diffusion models using this API. Here's the generated video:

102-0.04flashinfer-300M.mp4

Copy link
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, we should refactor the attention wrapper and plan interface in later PRs, more specifically:

  1. run ahead-of-time tile scheduler on gpu instead of cpu.
  2. avoid the data movement in plan functions, let user instead of wrapper to manage the cudagraph-safe buffers.

self._paged_kv_last_page_len = last_block_len.to(
self.device, non_blocking=non_blocking
)
torch.cuda.synchronize() # for non-blocking copy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the assumption is input tensors are device tensors?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose if input tensor is host tensor, then we can totally avoid it.

@Radioheading
Copy link

Great thanks to @happierpig and @yzh119 for supporting this. We firmly believe this will further push the boundaries of efficient attention in domains like video generation.

@yzh119 yzh119 merged commit ba3f324 into flashinfer-ai:main Jul 24, 2025
2 checks passed
@Edenzzzz
Copy link
Contributor

hi @haochengxi, could you share the code you used to generate this video? Would also like to have a try. Thanks.

@KevinZeng08
Copy link

Hi @happierpig, in my test, when I set seqlen=49152, kv_head=32, block_size=128, num_block=seqlen // block_size=384, backend='fa3', which means an uniform block sparse case, it easily causes the buffer size error.

raise ValueError(
"_vector_sparse_indices_buffer is not large enough. Please increase the buffer size."
)

Could you have a robust way to deal with larger number of blocks when setting _vector_sparse_indices_buffer? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants