[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples #913

LeiWang1999 · 2025-09-30T07:54:51Z

Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
Refactored import paths in benchmark_nsa_fwd.py for better organization.
Added a new function to generate configurations for autotuning.
Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
Changed allocation of shared memory for accumulators to optimize performance.

Summary by CodeRabbit

New Features
- Introduced autotuning options in the sparse attention benchmark, including configurable block size, pipeline stages, and thread count.
- Added automatic configuration generation to explore tuning combinations for better performance.
- Streamlined benchmark execution by using compiled kernels directly.
Chores
- Pinned a dependency to a specific commit for reproducible installations.
- Updated ignore rules to exclude AI assistant artifacts.
Refactor
- Simplified imports to use consolidated utility paths.
- Minor internal layout and parameter handling adjustments with no external behavior changes.

…ample - Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository. - Refactored import paths in benchmark_nsa_fwd.py for better organization. - Added a new function to generate configurations for autotuning. - Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility. - Changed allocation of shared memory for accumulators to optimize performance.

coderabbitai · 2025-09-30T07:55:01Z

Walkthrough

Adds a new ignore entry for .claude in .gitignore, introduces autotuning support and a new get_configs() in the NSA forward benchmark with updated kernel signature and import path, and pins flash-linear-attention to a specific commit in requirements.

Changes

Cohort / File(s)	Summary
Repo ignore updates `/.gitignore`	Add pattern `**/.claude` to ignore Claude tool directories; retain existing entries.
NSA benchmark autotune integration `examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py`	Change import of prepare_token_indices to `fla.ops.utils`; add `get_configs()` to generate autotune configurations; wrap kernel with TileLang autotune using generated configs; update `tilelang_sparse_attention` signature to include `block_T`, `num_stages`, `threads` (with defaults); adjust benchmark to use compiled kernel directly; internal layout and parameter handling tweaks.
Dependency pinning `examples/deepseek_nsa/requirements.txt`	Pin `flash-linear-attention` to commit `c3bd56589033610264532b11f0972c69e4645f6e` via `git+` URL.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Benchmark as benchmark_nsa_fwd.py
  participant TileLang as TileLang Autotune
  participant Kernel as SparseAttention Kernel
  participant FLA as flash-linear-attention

  User->>Benchmark: Run benchmark_nsa(...)
  Benchmark->>Benchmark: get_configs()
  Benchmark->>TileLang: decorate tilelang_sparse_attention(configs)
  TileLang->>Kernel: compile variants (block_T, num_stages, threads)
  Note over TileLang,Kernel: Autotune selects best-performing variant
  Benchmark->>Kernel: invoke compiled kernel
  Kernel->>FLA: use utils (prepare_token_indices)
  Kernel-->>Benchmark: results
  Benchmark-->>User: benchmark output

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I tuned my tiles by moonlit byte,
Threads hop in rows—a rhythmic sight.
Block_Ts stack like carrot stacks,
Stages line the bunny tracks.
Pinned a dep, ignored a clue,
Faster hops—benchmark, woo! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately highlights the two central changes—pinning the flash-linear-attention dependency to a specific commit and optimizing the NSA examples—so it is clearly related to the pull request’s primary objectives and concisely summarizes the main updates.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

Public repositories are always opted into early access features.
You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-30T07:55:03Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (3)

examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (3)
442-452: Consider the autotune search space size.

The configuration generates 75 combinations (3 × 5 × 5). This could result in lengthy autotune times, especially with the default warmup (25) and rep (100) settings from the autotune decorator.

Consider either:

Reducing the search space for faster iteration during development

Documenting expected autotune duration

Adding a smaller "quick" configuration set for testing

Example of a reduced search space:
 def get_configs():
     import itertools
     iter_params = dict(
-        block_T=[128, 256, 512],
-        num_stages=[0, 1, 2, 4, 5],
-        threads=[32, 64, 128, 256, 512],
+        block_T=[128, 256],
+        num_stages=[1, 2, 4],
+        threads=[128, 256],
     )
     return [{
         k: v for k, v in zip(iter_params, values)
     } for values in itertools.product(*iter_params.values())]
454-454: Remove trailing comma.

Minor style nitpick: the trailing comma after get_configs() is unnecessary.
-@tilelang.autotune(configs=get_configs(),)
+@tilelang.autotune(configs=get_configs())
515-515: Consider whether Q/K/V shared buffers need layout annotations.

Only O_shared has a swizzled layout annotation. Depending on the access patterns, Q_shared, K_shared, and V_shared might also benefit from layout annotations for better memory performance.

If the Q/K/V buffers also have strided access patterns that could benefit from swizzling, consider adding layout annotations:
-T.annotate_layout({O_shared: tilelang.layout.make_swizzled_layout(O_shared)})
+T.annotate_layout({
+    Q_shared: tilelang.layout.make_swizzled_layout(Q_shared),
+    K_shared: tilelang.layout.make_swizzled_layout(K_shared),
+    V_shared: tilelang.layout.make_swizzled_layout(V_shared),
+    O_shared: tilelang.layout.make_swizzled_layout(O_shared)
+})

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f92de93 and 709947c.

📒 Files selected for processing (3)

.gitignore (1 hunks)
examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (6 hunks)
examples/deepseek_nsa/requirements.txt (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (4)

tilelang/autotuner/tuner.py (1)

autotune (701-794)

tilelang/jit/__init__.py (1)

jit (237-310)

tilelang/math/__init__.py (1)

next_power_of_2 (1-2)

tilelang/language/__init__.py (1)

annotate_layout (106-144)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-amd
GitHub Check: format-check

🔇 Additional comments (7)

.gitignore (1)

96-97: LGTM!

The .claude ignore pattern is properly formatted and consistent with other directory-level ignores in the file.

examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (5)

464-467: LGTM! Autotune parameters properly integrated.

The new tuning parameters (block_T, num_stages, threads) are well-integrated with sensible defaults, allowing the function to work both with and without autotuning.

481-481: LGTM! Proper validation of block_T.

Clamping block_T to the next power of 2 of dim prevents invalid configurations and ensures memory alignment.

507-507: Verify the memory allocation change is intentional.

The allocation of acc_s_cast has changed from fragment (register) memory to shared memory. This could impact performance, as shared memory has higher latency than registers.

Please confirm:

Is this change necessary for correctness with the new autotune configurations?

Have you measured the performance impact of this change?

If this was done to support larger block sizes or fix a compilation issue, please document the reasoning.

611-621: LGTM! Clearer variable naming.

Renaming from program to kernel better reflects that this is a compiled, executable kernel rather than a program representation.

13-13: Cannot locate prepare_token_indices; please verify import
The search didn’t find any definition or re-export of prepare_token_indices under fla/ops/utils.py. Manually confirm that fla.ops.utils.prepare_token_indices is present in the pinned flash-linear-attention (commit c3bd565).

examples/deepseek_nsa/requirements.txt (1)

1-1: Pinned commit verified. The specified commit exists in fla-org/flash-linear-attention; no further action required.

…itory and optimize nsa examples (tile-ai#913) - Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository. - Refactored import paths in benchmark_nsa_fwd.py for better organization. - Added a new function to generate configurations for autotuning. - Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility. - Changed allocation of shared memory for accumulators to optimize performance.

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

LeiWang1999 merged commit 3ad6202 into tile-ai:main Sep 30, 2025
6 of 7 checks passed

coderabbitai bot mentioned this pull request Oct 5, 2025

[Example] Disable TMA and enable FastMath for NSA Examples (#941) #941

Merged

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Merged

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

This was referenced Nov 12, 2025

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples #913

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples #913

Uh oh!

LeiWang1999 commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples #913

[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples #913

Uh oh!

Conversation

LeiWang1999 commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Sep 30, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeiWang1999 commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading