[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. #1290

happierpig · 2025-07-19T00:17:56Z

📌 Description

Per discussion with @haochengxi and @Radioheading, this PR moves the plan function in VariableBlockSparseAttentionWrapper to the GPU side, to avoid expensive (hundreds ms) host operations.
This PR also enlarges the default internal buffer size to accommodate video DiT use cases.
This PR fixes the INT overflow during offset calculation in attention map. This causes errors in customized_mask mode of FA2 prefill template. E.g., with a kv_len=128K, the last element of the attention map will be 128*128*1e6=1e10, which is larger than INT32_MAX.

🔍 Related Issues

This PR should solve #1271

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

…warning.

gemini-code-assist

Summary of Changes

Hello @happierpig, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on optimizing the VariableBlockSparseAttentionWrapper by offloading its planning phase to the GPU, which addresses a significant performance bottleneck. Additionally, it enhances robustness by increasing internal buffer capacities to support larger models like video DiT and introduces explicit buffer overflow warnings to guide users in case of insufficient memory allocation.

Highlights

Performance Optimization (GPU-side Planning): The plan function within VariableBlockSparseAttentionWrapper has been moved to execute on the GPU. This change aims to significantly reduce host-side overhead, which was previously identified as an expensive operation taking hundreds of milliseconds, thereby improving overall performance.
Buffer Size Increase for Video DiT Models: The default internal buffer size for _vector_sparse_indices_buffer in both BlockSparseAttentionWrapper and VariableBlockSparseAttentionWrapper has been substantially increased from 4MB to 128MB. This adjustment is specifically made to accommodate the larger memory requirements of video DiT (Diffusion Transformer) models.
Buffer Overflow Warnings: New ValueError checks have been added to the run and plan methods. These checks will now proactively warn users if the internal _vector_sparse_indices_buffer or _vector_sparse_indptr_buffer are not sufficiently large to handle the current workload, preventing potential silent failures or unexpected behavior.
Test Suite Enhancements: The test suite for block sparse attention has been updated to include set_seed for reproducibility and to ensure that input tensors for variable block sparse attention tests are created directly on the GPU (cuda:0), aligning with the new GPU-side planning logic.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request moves the planning logic to the GPU for VariableBlockSparseAttentionWrapper which should help with performance by reducing host-side overhead. The increased buffer size and the new buffer overflow warnings are also valuable additions for robustness. The review focuses on refining the logic of the new buffer overflow checks to prevent potential false positives and improving their error messages for better debuggability.

gemini-code-assist · 2025-07-19T00:19:58Z

flashinfer/sparse.py

+                if (
+                    self._vector_sparse_indices_buffer.numel()
+                    <= self._paged_kv_indices_buf.numel() * self.C
+                ):


The condition self._vector_sparse_indices_buffer.numel() <= self._paged_kv_indices_buf.numel() * self.C will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.

Suggested change

if (

self._vector_sparse_indices_buffer.numel()

<= self._paged_kv_indices_buf.numel() * self.C

):

if (

self._vector_sparse_indices_buffer.numel()

< self._paged_kv_indices_buf.numel() * self.C

):

gemini-code-assist · 2025-07-19T00:19:58Z

flashinfer/sparse.py

        if self._backend == "fa3":
-            self._vector_sparse_indptr_buffer[: len(kv_indptr_host)].copy_(
-                kv_indptr_host, non_blocking=non_blocking
+            if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel():


The condition self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel() will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.

Suggested change

if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel():

if self._vector_sparse_indptr_buffer.numel() < kv_indptr.numel():

gemini-code-assist · 2025-07-19T00:19:58Z

flashinfer/sparse.py

+            if (
+                self._vector_sparse_indices_buffer.numel()
+                <= self._paged_kv_indices_buf.numel()
+            ):


The condition self._vector_sparse_indices_buffer.numel() <= self._paged_kv_indices_buf.numel() will raise an error even when the buffer size is exactly equal to the required size, which should be a valid case. The check should be for when the buffer is strictly smaller than the required size. It would be helpful for debugging if the error message included the required and available buffer sizes.

Suggested change

if (

self._vector_sparse_indices_buffer.numel()

<= self._paged_kv_indices_buf.numel()

):

if (

self._vector_sparse_indices_buffer.numel()

< self._paged_kv_indices_buf.numel()

):

happierpig · 2025-07-20T04:59:43Z

@yzh119 Would you mind reviewing this? This should be ready for review.

haochengxi · 2025-07-20T05:05:59Z

Thanks @happierpig for this great feature. We can generate near-identical videos when applying sparse attention on video diffusion models using this API. Here's the generated video:

102-0.04flashinfer-300M.mp4

yzh119

Overall LGTM, we should refactor the attention wrapper and plan interface in later PRs, more specifically:

run ahead-of-time tile scheduler on gpu instead of cpu.
avoid the data movement in plan functions, let user instead of wrapper to manage the cudagraph-safe buffers.

yzh119 · 2025-07-21T06:35:33Z

flashinfer/sparse.py

        self._paged_kv_last_page_len = last_block_len.to(
            self.device, non_blocking=non_blocking
        )
+        torch.cuda.synchronize()  # for non-blocking copy


Here the assumption is input tensors are device tensors?

I suppose if input tensor is host tensor, then we can totally avoid it.

Radioheading · 2025-07-24T17:05:27Z

Great thanks to @happierpig and @yzh119 for supporting this. We firmly believe this will further push the boundaries of efficient attention in domains like video generation.

Edenzzzz · 2025-07-28T00:10:26Z

hi @haochengxi, could you share the code you used to generate this video? Would also like to have a try. Thanks.

KevinZeng08 · 2025-09-06T02:14:12Z

Hi @happierpig, in my test, when I set seqlen=49152, kv_head=32, block_size=128, num_block=seqlen // block_size=384, backend='fa3', which means an uniform block sparse case, it easily causes the buffer size error.

flashinfer/flashinfer/sparse.py

Lines 1187 to 1189 in 17978a3

    
           raise ValueError( 
        
               "_vector_sparse_indices_buffer is not large enough. Please increase the buffer size." 
        
           )

Could you have a robust way to deal with larger number of blocks when setting _vector_sparse_indices_buffer? Thanks!

[fix] move VariableBlockSparsePlan to GPU-side & add buffer overflow …

896ff59

…warning.

happierpig requested a review from yzh119 July 19, 2025 00:18

gemini-code-assist bot reviewed Jul 19, 2025

View reviewed changes

happierpig added 2 commits July 19, 2025 06:00

[fix] fix int_32 overflow in FA2 customize mask

b26b9d7

format

cda5429

happierpig changed the title ~~[fix] move VariableBlockSparsePlan to GPU-side & add buffer overflow warning.~~ [fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. Jul 19, 2025

upd

a0580aa

happierpig mentioned this pull request Jul 20, 2025

Is VariableBlockSparseAttentionWrapper ready? #1271

Closed

yzh119 approved these changes Jul 21, 2025

View reviewed changes

Merge branch 'main' into main

0e419e6

yzh119 approved these changes Jul 24, 2025

View reviewed changes

yzh119 merged commit ba3f324 into flashinfer-ai:main Jul 24, 2025
2 checks passed

KevinZeng08 mentioned this pull request Sep 6, 2025

Error "_vector_sparse_indices_buffer is not large enough" for VariableBlockSparseAttention #1647

Closed

	if self._vector_sparse_indptr_buffer.numel() <= kv_indptr.numel():
	if self._vector_sparse_indptr_buffer.numel() < kv_indptr.numel():

[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. #1290

[fix] fix integer overflow in FA2 customized_mask & add buffer overflow warning. #1290

Uh oh!

Conversation

happierpig commented Jul 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

happierpig commented Jul 20, 2025

Uh oh!

haochengxi commented Jul 20, 2025

Uh oh!

yzh119 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yzh119 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

yzh119 Jul 21, 2025

Choose a reason for hiding this comment

Uh oh!

Radioheading commented Jul 24, 2025

Uh oh!

Uh oh!

Edenzzzz commented Jul 28, 2025

Uh oh!

KevinZeng08 commented Sep 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

happierpig commented Jul 19, 2025 •

edited

Loading

yzh119 left a comment •

edited

Loading