Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Sep 30, 2025

  • Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
  • Refactored import paths in benchmark_nsa_fwd.py for better organization.
  • Added a new function to generate configurations for autotuning.
  • Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
  • Changed allocation of shared memory for accumulators to optimize performance.

Summary by CodeRabbit

  • New Features

    • Introduced autotuning options in the sparse attention benchmark, including configurable block size, pipeline stages, and thread count.
    • Added automatic configuration generation to explore tuning combinations for better performance.
    • Streamlined benchmark execution by using compiled kernels directly.
  • Chores

    • Pinned a dependency to a specific commit for reproducible installations.
    • Updated ignore rules to exclude AI assistant artifacts.
  • Refactor

    • Simplified imports to use consolidated utility paths.
    • Minor internal layout and parameter handling adjustments with no external behavior changes.

…ample

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

Walkthrough

Adds a new ignore entry for .claude in .gitignore, introduces autotuning support and a new get_configs() in the NSA forward benchmark with updated kernel signature and import path, and pins flash-linear-attention to a specific commit in requirements.

Changes

Cohort / File(s) Summary
Repo ignore updates
/.gitignore
Add pattern **/.claude to ignore Claude tool directories; retain existing entries.
NSA benchmark autotune integration
examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py
Change import of prepare_token_indices to fla.ops.utils; add get_configs() to generate autotune configurations; wrap kernel with TileLang autotune using generated configs; update tilelang_sparse_attention signature to include block_T, num_stages, threads (with defaults); adjust benchmark to use compiled kernel directly; internal layout and parameter handling tweaks.
Dependency pinning
examples/deepseek_nsa/requirements.txt
Pin flash-linear-attention to commit c3bd56589033610264532b11f0972c69e4645f6e via git+ URL.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor User
  participant Benchmark as benchmark_nsa_fwd.py
  participant TileLang as TileLang Autotune
  participant Kernel as SparseAttention Kernel
  participant FLA as flash-linear-attention

  User->>Benchmark: Run benchmark_nsa(...)
  Benchmark->>Benchmark: get_configs()
  Benchmark->>TileLang: decorate tilelang_sparse_attention(configs)
  TileLang->>Kernel: compile variants (block_T, num_stages, threads)
  Note over TileLang,Kernel: Autotune selects best-performing variant
  Benchmark->>Kernel: invoke compiled kernel
  Kernel->>FLA: use utils (prepare_token_indices)
  Kernel-->>Benchmark: results
  Benchmark-->>User: benchmark output
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I tuned my tiles by moonlit byte,
Threads hop in rows—a rhythmic sight.
Block_Ts stack like carrot stacks,
Stages line the bunny tracks.
Pinned a dep, ignored a clue,
Faster hops—benchmark, woo! 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately highlights the two central changes—pinning the flash-linear-attention dependency to a specific commit and optimizing the NSA examples—so it is clearly related to the pull request’s primary objectives and concisely summarizes the main updates.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🧪 Early access (Sonnet 4.5): enabled

We are currently testing the Sonnet 4.5 model, which is expected to improve code review quality. However, this model may lead to increased noise levels in the review comments. Please disable the early access features if the noise level causes any inconvenience.

Note:

  • Public repositories are always opted into early access features.
  • You can enable or disable early access features from the CodeRabbit UI or by updating the CodeRabbit configuration file.

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (3)

442-452: Consider the autotune search space size.

The configuration generates 75 combinations (3 × 5 × 5). This could result in lengthy autotune times, especially with the default warmup (25) and rep (100) settings from the autotune decorator.

Consider either:

  1. Reducing the search space for faster iteration during development
  2. Documenting expected autotune duration
  3. Adding a smaller "quick" configuration set for testing

Example of a reduced search space:

 def get_configs():
     import itertools
     iter_params = dict(
-        block_T=[128, 256, 512],
-        num_stages=[0, 1, 2, 4, 5],
-        threads=[32, 64, 128, 256, 512],
+        block_T=[128, 256],
+        num_stages=[1, 2, 4],
+        threads=[128, 256],
     )
     return [{
         k: v for k, v in zip(iter_params, values)
     } for values in itertools.product(*iter_params.values())]

454-454: Remove trailing comma.

Minor style nitpick: the trailing comma after get_configs() is unnecessary.

-@tilelang.autotune(configs=get_configs(),)
+@tilelang.autotune(configs=get_configs())

515-515: Consider whether Q/K/V shared buffers need layout annotations.

Only O_shared has a swizzled layout annotation. Depending on the access patterns, Q_shared, K_shared, and V_shared might also benefit from layout annotations for better memory performance.

If the Q/K/V buffers also have strided access patterns that could benefit from swizzling, consider adding layout annotations:

-T.annotate_layout({O_shared: tilelang.layout.make_swizzled_layout(O_shared)})
+T.annotate_layout({
+    Q_shared: tilelang.layout.make_swizzled_layout(Q_shared),
+    K_shared: tilelang.layout.make_swizzled_layout(K_shared),
+    V_shared: tilelang.layout.make_swizzled_layout(V_shared),
+    O_shared: tilelang.layout.make_swizzled_layout(O_shared)
+})
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f92de93 and 709947c.

📒 Files selected for processing (3)
  • .gitignore (1 hunks)
  • examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (6 hunks)
  • examples/deepseek_nsa/requirements.txt (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (4)
tilelang/autotuner/tuner.py (1)
  • autotune (701-794)
tilelang/jit/__init__.py (1)
  • jit (237-310)
tilelang/math/__init__.py (1)
  • next_power_of_2 (1-2)
tilelang/language/__init__.py (1)
  • annotate_layout (106-144)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-amd
  • GitHub Check: format-check
🔇 Additional comments (7)
.gitignore (1)

96-97: LGTM!

The .claude ignore pattern is properly formatted and consistent with other directory-level ignores in the file.

examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py (5)

464-467: LGTM! Autotune parameters properly integrated.

The new tuning parameters (block_T, num_stages, threads) are well-integrated with sensible defaults, allowing the function to work both with and without autotuning.


481-481: LGTM! Proper validation of block_T.

Clamping block_T to the next power of 2 of dim prevents invalid configurations and ensures memory alignment.


507-507: Verify the memory allocation change is intentional.

The allocation of acc_s_cast has changed from fragment (register) memory to shared memory. This could impact performance, as shared memory has higher latency than registers.

Please confirm:

  1. Is this change necessary for correctness with the new autotune configurations?
  2. Have you measured the performance impact of this change?

If this was done to support larger block sizes or fix a compilation issue, please document the reasoning.


611-621: LGTM! Clearer variable naming.

Renaming from program to kernel better reflects that this is a compiled, executable kernel rather than a program representation.


13-13: Cannot locate prepare_token_indices; please verify import
The search didn’t find any definition or re-export of prepare_token_indices under fla/ops/utils.py. Manually confirm that fla.ops.utils.prepare_token_indices is present in the pinned flash-linear-attention (commit c3bd565).

examples/deepseek_nsa/requirements.txt (1)

1-1: Pinned commit verified. The specified commit exists in fla-org/flash-linear-attention; no further action required.

@LeiWang1999 LeiWang1999 merged commit 3ad6202 into tile-ai:main Sep 30, 2025
6 of 7 checks passed
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
…itory and optimize nsa examples (tile-ai#913)

- Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
- Refactored import paths in benchmark_nsa_fwd.py for better organization.
- Added a new function to generate configurations for autotuning.
- Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
- Changed allocation of shared memory for accumulators to optimize performance.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant