[Example] Update GQA varlen fwd #1173

chengyupku · 2025-11-02T16:09:32Z

This pull request refactors and improves the example_gqa_fwd_varlen.py implementation for variable-length Flash Attention with grouped query attention (GQA). The changes enhance correctness, efficiency, and clarity, especially around sequence masking, causal logic, and kernel launch parameters.

Attention masking and causal logic improvements

Refactored the reference attention computation in attention_ref to use explicit window-based masking for visible tokens, improving correctness for variable-length and causal attention. Key and query padding masks are now combined with the visibility mask for more accurate masking.
Updated causal logic in the main kernel to use a more robust mask and loop range calculation, ensuring proper handling of causal and non-causal cases. [1] [2]

Kernel parameter and launch optimizations

Increased block sizes and parallelism (block_M, block_N, num_stages, threads) for the tile-lang kernel launch, improving performance.
Adjusted benchmarking to use multiple warmup and repeat runs for more reliable latency measurements.

Code simplification and correctness fixes

Removed unnecessary requires_grad=True from input tensor creation, as gradients are not needed for this example.
Unified key and value start/end indices and sequence lengths for grouped attention, reducing code duplication and potential errors. [1] [2]

Summary by CodeRabbit

Bug Fixes
- Improved numerical stability in attention computation
- Enhanced masking for windowed attention and causal patterns
Performance Improvements
- Optimized kernel tiling configuration for better throughput
- Refined benchmarking measurements for more accurate performance assessment

coderabbitai · 2025-11-02T16:09:41Z

Walkthrough

The PR updates the GQA variable-length flash attention example with improved windowed attention handling through visibility masking, enhanced numerical stability via explicit score masking, consolidated kernel indexing parameters, and increased tiling configuration for performance optimization across reference implementation and main lighting kernels.

Changes

Cohort / File(s)	Summary
GQA Flash Attention Example Updates `examples/flash_attention/example_gqa_fwd_varlen.py`	attention_ref: Replaced implicit dimension calculations with explicit shape unpacking (b, T, Hq, D); added windowed attention with visibility mask across time/source positions applied to scores via neg_inf; integrated key padding masking with visibility mask; updated tiling g-factors; applied softmax on masked scores; updated output masking logic. main/lighting: Increased block_M/block_N from 64→128, num_stages from 1→2, threads from 128→256; removed requires_grad guards; added is_causal-driven flops branch; consolidated k_start_idx/v_start_idx to kv_start_idx; replaced strict -infinity with -1e9 checks; added explicit max-sum updates; expanded do_bench parameters; added layout annotations for O_shared/Q_shared.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

attention_ref masking logic: Verify windowed attention visibility mask implementation, intersection with key padding mask, and softmax application order for correctness
Numerical stability changes: Validate -1e9 replacement for strict -infinity checks across causal/non-causal paths and confirm score normalization adjustments
Kernel indexing consolidation: Ensure kv_start_idx correctly replaces separate k_start_idx/v_start_idx throughout kernel wiring
Tiling configuration impact: Confirm new block sizes (128/128) and stage counts (2) are compatible with memory layout annotations and V_unpad indexing

Possibly related PRs

[Example] Update GQA varlen fwd and MHA varlen fwd #1071: Main PR that modifies the same example_gqa_fwd_varlen.py file with attention reference, kernel wiring, masking/causal logic, and benchmarking updates
[Feature] Add GQA backward kernel with varlen input #1082: Related GQA variable-length attention work extending the example with backward kernels and cross-cutting varlen attention improvements

Suggested reviewers

LeiWang1999

Poem

🐰 With masked attention, windowed and bright,
Visibility layers now set things right,
Tiling expanded, performance takes flight,
Kernels refined with consolidated might,
GQA varlen shines in the light! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The pull request title "[Example] Update GQA varlen fwd" clearly and specifically identifies the main change: updating the example file for grouped query attention (GQA) with variable-length forward pass. The title is concise, free of vague terminology or noise, and specific enough that a reviewer scanning the pull request history would immediately understand this concerns updates to the example_gqa_fwd_varlen.py file. While the title doesn't enumerate all implementation details (masking refactoring, kernel optimizations, etc.), that level of detail is not expected in a PR title per the guidelines. The title successfully captures the primary change from the developer's perspective.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-11-02T16:09:43Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c85bb3a and 1cf3bd7.

📒 Files selected for processing (1)

examples/flash_attention/example_gqa_fwd_varlen.py (7 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/flash_attention/example_gqa_fwd_varlen.py (7)

tilelang/language/annotations.py (1)

annotate_layout (25-36)

tilelang/layout/swizzle.py (1)

make_swizzled_layout (10-18)

tilelang/language/copy.py (1)

copy (11-87)

tilelang/language/fill.py (2)

fill (9-21)

clear (24-48)

tilelang/language/pipeline.py (1)

Pipelined (9-46)

tilelang/language/tir/op.py (1)

if_then_else (2907-2937)

tilelang/language/reduce.py (1)

reduce_max (50-68)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

coderabbitai · 2025-11-02T16:25:12Z

examples/flash_attention/example_gqa_fwd_varlen.py

+            loop_range = (
+                T.min(
+                    T.ceildiv(q_current_seqlen +
+                              (bx + 1) * block_M, block_N), T.ceildiv(kv_current_seqlen, block_N))
+                if is_causal else T.ceildiv(kv_current_seqlen, block_N))



⚠️ Potential issue | 🟠 Major

Fix causal tile loop_range regression

For is_causal=True the new loop_range ends up as ceildiv(kv_current_seqlen, block_N) for every tile (e.g., q=kv=2048, block_M=block_N=128 ⇒ the very first tile now iterates all 16 K tiles instead of just the first one). We lose the triangular work reduction, so causal runs pay the full non-causal cost—latency blows up roughly ×16 on long sequences. Replace the sum with a clamp to the end of the current query tile.

- loop_range = ( - T.min( - T.ceildiv(q_current_seqlen + - (bx + 1) * block_M, block_N), T.ceildiv(kv_current_seqlen, block_N)) - if is_causal else T.ceildiv(kv_current_seqlen, block_N)) + loop_range = ( + T.min( + T.ceildiv(T.min(q_current_seqlen, (bx + 1) * block_M), block_N), + T.ceildiv(kv_current_seqlen, block_N)) + if is_causal else T.ceildiv(kv_current_seqlen, block_N))

🤖 Prompt for AI Agents

In examples/flash_attention/example_gqa_fwd_varlen.py around lines 130-135, the causal branch builds loop_range by adding (bx+1)*block_M to q_current_seqlen which causes every causal tile to iterate up to the full KV length; replace the sum with a clamp to the end of the current query tile so the loop range uses the minimum of q_current_seqlen and (bx+1)*block_M before taking ceildiv and then min that with the KV ceildiv; implement this by computing the tile-end = min(q_current_seqlen, (bx+1)*block_M), using ceildiv(tile-end, block_N), and then min(...) with ceildiv(kv_current_seqlen, block_N) for the causal case (non-causal case unchanged).

* [Example] Update GQA varlen fwd * fix

chengyupku added 2 commits November 3, 2025 00:04

[Example] Update GQA varlen fwd

5dbf1e3

fix

1cf3bd7

coderabbitai bot reviewed Nov 2, 2025

View reviewed changes

LeiWang1999 approved these changes Nov 5, 2025

View reviewed changes

LeiWang1999 merged commit a9d823b into tile-ai:main Nov 5, 2025
6 of 7 checks passed

This was referenced Nov 10, 2025

[Fix] Fix buffer re-import typo in tilelang.languge #1214

Merged

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

coderabbitai bot mentioned this pull request Nov 17, 2025

[Enhancement] Keep max score attention across blocks in FlashAttention for better numerical stablity #1269

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025

[Example] Update GQA varlen fwd (tile-ai#1173)

5d46d0c

* [Example] Update GQA varlen fwd * fix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Example] Update GQA varlen fwd #1173

[Example] Update GQA varlen fwd #1173

Uh oh!

chengyupku commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 2, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Nov 2, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 2, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Example] Update GQA varlen fwd #1173

[Example] Update GQA varlen fwd #1173

Uh oh!

Conversation

chengyupku commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Attention masking and causal logic improvements

Kernel parameter and launch optimizations

Code simplification and correctness fixes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Nov 2, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chengyupku commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 2, 2025 •

edited

Loading