[Feature]: Add benchmark scripts for examples #1240

yyttt6 · 2025-11-12T10:14:04Z

Summary by CodeRabbit

New Features
- Adds a lightweight benchmarking framework with warmup/repeat timing, failure tracking, aggregated results table, and PNG visualization.
- Registers many new example benchmark entry points across the examples suite for unified discovery and execution.
Chores
- CI/perf workflow updated to produce machine-readable benchmark tables and upload/embed generated plots in PR feedback.

github-actions · 2025-11-12T10:14:13Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-11-12T10:14:17Z

Walkthrough

Adds a lightweight benchmarking framework (tilelang/tools/bench.py), ~30 per-example bench entry scripts under examples/* that register bench_* wrappers calling example main() via process_func, CI + maint updates to parse results and upload an image, and a runtime log artifact.

Changes

Cohort / File(s)	Summary
Core benchmarking framework `tilelang/tools/bench.py`	New module implementing `process_func(func, args, repeat=10, warmup=3, *kwargs)`, `main()` (discover & run `bench_` functions), `analyze_records()` and exports `__all__ = ["main","process_func"]`.
CI & maint scripts `\.github/workflows/pr-perfbench-bot.yml`, `maint/scripts/ci_performance.py`	Workflow updated to read `bench.md`, upload bench image and embed it in PR comments; `ci_performance.py` rewritten to parse bench outputs, generate `bench.md` and `bench.png` using pandas/matplotlib.
GEMM / GEMM variants / FP8 / Dequantize / GEMV `examples/gemm/...`, `examples/gemm_fp8/...`, `examples/gemm_splitk/...`, `examples/gemm_streamk/...`, `examples/dequantize_gemm/...`, `examples/gemv/bench_example_gemv.py`	Added bench wrapper scripts exposing `bench_` functions that call example module `main()` via `tilelang.tools.bench.process_func()`; some pass example-specific kwargs (e.g., do_bench=False).
Attention / Flash / Blocksparse / Linear attention `examples/attention_sink/bench_example_attention_sink.py`, `examples/flash_attention/bench_example_flash_attention.py`, `examples/blocksparse_attention/bench_example_blocksparse_attention.py`, `examples/seer_attention/bench_block_sparse_attn_tilelang.py`, `examples/linear_attention/bench_linear_attn.py`	New aggregator bench scripts registering many forward/backward GQA/MHA/blocksparse/linear-attention `bench_` functions (including sliding-window and wgmma_pipelined variants).
Decoding / Deepseek / Deep models `examples/flash_decoding/bench_example_flash_decoding.py`, `examples/deepseek_/`, `examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py`	Bench wrappers for decode/inference/deepseek example mains; multiple `bench_` functions wired to existing example mains.
Misc example wrappers `examples/convolution/bench_example_convolution.py`, `examples/elementwise/bench_example_elementwise.py`, `examples/dynamic_shape/bench_example_dynamic.py`, `examples/cast/bench_example_cast.py`, `examples/analyze/bench_example_analyze.py`, `examples/fusedmoe/bench_example_fusedmoe.py`, `examples/topk/bench_topk_tilelang.py`, `examples/minference/bench_vs_sparse_attn.py`, `examples/sparse_tensorcore/bench_example_sparse_tensorcore.py`, `examples/warp_specialize/bench_example_warp_specialize.py`, ...	Many small example-specific bench scripts added; each defines one or a few `bench_` wrappers calling example `main()` and usually a `__main__` guard that invokes `tilelang.tools.bench.main()`.
Logs / artifacts `log.txt`	Adds runtime benchmark log output capturing failure traces (TileLang builder errors, CUDA OOMs), environment warnings, and repeated benchmark invocation traces.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant BenchMain as tilelang.tools.bench.main()
    participant Discover as discover bench_ funcs
    participant BenchFn as bench_*()
    participant Proc as process_func()
    participant Warm as warmup runs
    participant Runs as timed runs
    participant Records as record store
    participant Analyze as analyze_records()
    participant Output as bench.md / bench_result.png

    User->>BenchMain: run examples/.../bench_*.py or bench_all()
    BenchMain->>Discover: introspect caller file(s) for bench_* functions
    Discover-->>BenchMain: list of bench functions
    loop for each bench_* function
        BenchMain->>BenchFn: invoke bench_*()
        BenchFn->>Proc: process_func(example.main, ...)
        Proc->>Warm: perform warmup runs (ignored)
        Proc->>Runs: perform timed runs, collect latencies/failures
        Runs-->>Proc: latencies / failure info
        Proc-->>Records: append result entry
    end
    BenchMain->>Analyze: analyze_records(Records)
    Analyze-->>Output: emit bench.md and bench_result.png
    Output-->>User: display summary and files

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Attention points:
- tilelang/tools/bench.py: timing calculations, warmup exclusion, exception handling, record serialization, image/file path handling.
- .github/workflows/pr-perfbench-bot.yml: blob upload steps and correct outputs/URL construction used in PR comment assembly.
- maint/scripts/ci_performance.py: parsing robustness, DataFrame column mapping, plotting layout and file write behavior.
- Spot-check several example wrappers for correct call signatures and consistent __main__ guards (notably examples/gemv/bench_example_gemv.py and any wrappers passing kwargs).

Possibly related PRs

[Example] Add examples to support efficient attention sink forward process #853 — Adds attention-sink example modules that are wrapped by examples/attention_sink/bench_example_attention_sink.py.
[Example] Add block level high performance gemv example #1097 — Changes to example_gemv.main signature/behavior relevant to examples/gemv/bench_example_gemv.py.
[CI][Refactor] Refactor non-test CI workflow files #971 — Prior perfbench workflow changes related to .github/workflows/pr-perfbench-bot.yml and maint/scripts/ci_performance.py.

Suggested reviewers

LeiWang1999
tzj-fxz

Poem

🐰 I hop with stopwatch and tiny feet,

I warm the runs and time each beat,
I gather charts in tidy rows,
Markdowns, pixels — where the benchmark glows,
Hop, save, report — then nibble carrot treats. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title '[Feature]: Add benchmark scripts for examples' clearly and concisely summarizes the main purpose of the changeset—adding benchmark scripts across multiple example directories.

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7d75df2 and 58e7ffe.

📒 Files selected for processing (1)

.github/workflows/pr-perfbench-bot.yml (3 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-10T13:29:29.347Z

Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

.github/workflows/pr-perfbench-bot.yml

🔇 Additional comments (5)

.github/workflows/pr-perfbench-bot.yml (5)

9-9: ✓ Permissions updated correctly.

The workflow permissions have been upgraded from contents: read to contents: write, which is necessary for the git.createBlob and git.createTree calls in the upload step. This fixes the 403 error issue flagged in prior review.

103-103: ✓ Filename consistency fixed.

The tree entry path (line 103) and raw URL (line 110) now both consistently use bench_${context.runId}.png, eliminating the 404 error on the embedded image URL.

Also applies to: 110-110

77-82: Read markdown step is well-structured.

The step correctly reads bench.md and exposes it as a multi-line output. Ensure ci_performance.py creates this file during benchmark execution.

83-111: PNG upload and URL generation look correct.

The JavaScript uses consistent naming (bench_${context.runId}.png) and properly constructs the raw GitHub URL for embedding. Ensure ci_performance.py creates bench.png during benchmark execution.

118-130: PR comment composition correctly integrates markdown and image.

The comment body properly references the read_md and upload_png outputs and includes the image URL for embedding.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 17

🧹 Nitpick comments (2)

examples/gemm/bench_example_gemm.py (1)

1-29: Note potential duplication with bench_gemm.py.

This file appears to duplicate functionality from examples/gemm/bench_gemm.py in the same directory. Both files define identical benchmark wrapper functions for the same example modules. Consider consolidating these into a single benchmark file unless there's a specific reason for maintaining both.
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
1-3: Consider reordering imports for clarity.

Import tilelang before its sub-module tilelang.tools.bench for better readability.

Apply this diff:
-import tilelang.tools.bench
 import tilelang
+import tilelang.tools.bench
 import tilelang_example_sparse_tensorcore

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30d8ded and 9aa92e1.

📒 Files selected for processing (29)

examples/analyze/bench_example_analyze.py (1 hunks)
examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1 hunks)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1 hunks)
examples/cast/bench_example_cast.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1 hunks)
examples/deepseek_mla/bench_example_mla_decode.py (1 hunks)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1 hunks)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1 hunks)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1 hunks)
examples/dynamic_shape/bench_example_dynamic.py (1 hunks)
examples/elementwise/bench_example_elementwise.py (1 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)
examples/flash_decoding/bench_example_flash_decoding.py (1 hunks)
examples/fusedmoe/bench_example_fusedmoe.py (1 hunks)
examples/gemm/bench_example_gemm.py (1 hunks)
examples/gemm/bench_gemm.py (1 hunks)
examples/gemm_fp8/bench_example_gemm_fp8.py (1 hunks)
examples/gemm_splitk/bench_example_gemm_splitk.py (1 hunks)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1 hunks)
examples/gemv/bench_example_gemv.py (1 hunks)
examples/linear_attention/bench_linear_attn.py (1 hunks)
examples/minference/bench_vs_sparse_attn.py (1 hunks)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1 hunks)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1 hunks)
examples/topk/bench_topk_tilelang.py (1 hunks)
examples/warp_specialize/bench_example_warp_specialize.py (1 hunks)
tilelang/tools/bench.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (29)

examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_le (121-122)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

tilelang/tools/bench.py (16)

examples/attention_sink/bench_example_attention_sink.py (1)

main (66-67)

examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)

main (14-15)

examples/analyze/bench_example_analyze.py (1)

main (14-15)

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

main (34-35)

examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)

main (9-10)

examples/cast/bench_example_cast.py (1)

main (14-15)

examples/convolution/bench_example_convolution.py (1)

main (16-17)

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)

main (11-12)

examples/deepseek_mla/bench_example_mla_decode.py (1)

main (11-12)

examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1)

main (35-36)

examples/dequantize_gemm/bench_example_dequantize_gemm.py (1)

main (45-46)

examples/dynamic_shape/bench_example_dynamic.py (1)

main (9-10)

examples/elementwise/bench_example_elementwise.py (1)

main (9-10)

examples/flash_attention/bench_example_flash_attention.py (1)

main (87-88)

examples/flash_decoding/bench_example_flash_decoding.py (1)

main (16-17)

examples/fusedmoe/bench_example_fusedmoe.py (1)

main (9-10)

examples/gemm/bench_example_gemm.py (2)

examples/gemm/bench_gemm.py (4)

bench_example_gemm_autotune (8-9)

bench_example_gemm_intrinsics (12-13)

bench_example_gemm_schedule (16-17)

bench_example_gemm (20-21)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/seer_attention/bench_block_sparse_attn_tilelang.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (7)

examples/deepseek_v32/sparse_mla_bwd.py (2)

sparse_mla_bwd (283-320)

test_sparse_mla_bwd (334-384)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_v32/topk_selector.py (1)

test_topk_selector (188-245)

examples/deepseek_v32/fp8_lighting_indexer.py (1)

test_fp8_lighting_indexer (260-302)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (105-106)

examples/deepseek_v32/sparse_mla_fwd.py (1)

test_sparse_mla_fwd (253-299)

examples/deepseek_v32/sparse_mla_fwd_pipelined.py (1)

test_sparse_mla_fwd_pipelined (400-452)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version (38-102)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/dynamic_shape/bench_example_dynamic.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/flash_decoding/bench_example_flash_decoding.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_le (121-122)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_mla/bench_example_mla_decode.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (105-106)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/dequantize_gemm/bench_example_dequantize_gemm.py (2)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (105-106)

examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemm/bench_gemm.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/warp_specialize/bench_example_warp_specialize.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_eq (113-114)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/topk/bench_topk_tilelang.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/elementwise/bench_example_elementwise.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/analyze/bench_example_analyze.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemm_splitk/bench_example_gemm_splitk.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/flash_attention/bench_example_flash_attention.py (2)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (105-106)

examples/convolution/bench_example_convolution.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_le (121-122)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/fusedmoe/bench_example_fusedmoe.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/linear_attention/bench_linear_attn.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemv/bench_example_gemv.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (2)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_eq (113-114)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/minference/bench_vs_sparse_attn.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/attention_sink/bench_example_attention_sink.py (2)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

tilelang/testing/__init__.py (1)

requires_cuda_compute_version_ge (105-106)

examples/cast/bench_example_cast.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

🪛 Ruff (0.14.4)

tilelang/tools/bench.py

33-34: try-except-pass detected, consider logging the exception

(S110)

33-33: Do not catch blind exception: Exception

(BLE001)

45-45: Do not catch blind exception: Exception

(BLE001)

66-66: Use of exec detected

(S102)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (32)

examples/gemm_splitk/bench_example_gemm_splitk.py (1)

1-19: LGTM!

The benchmark wrapper correctly follows the established pattern and properly integrates the GEMM split-K examples into the benchmarking framework.

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

1-24: LGTM!

The benchmark wrapper correctly integrates three FP8 GEMM variants into the benchmarking framework with a clean, consistent structure.

examples/gemv/bench_example_gemv.py (1)

1-14: LGTM!

The benchmark wrapper correctly integrates the GEMV example into the benchmarking framework with the expected structure.

examples/dynamic_shape/bench_example_dynamic.py (1)

1-14: LGTM!

The benchmark wrapper correctly integrates the dynamic shape example into the benchmarking framework with a clean, consistent structure.

examples/fusedmoe/bench_example_fusedmoe.py (1)

1-14: LGTM! Standard benchmark wrapper.

The implementation correctly follows the established benchmarking pattern, delegating to the shared tilelang.tools.bench utility.

examples/analyze/bench_example_analyze.py (1)

1-19: LGTM! Correctly wraps multiple example entry points.

The implementation follows the established pattern and properly wires both GEMM and convolution analyze examples into the benchmarking framework.

examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)

1-14: LGTM! Standard benchmark wrapper.

The implementation correctly follows the established benchmarking pattern.

examples/elementwise/bench_example_elementwise.py (1)

1-14: LGTM! Standard benchmark wrapper.

The implementation correctly follows the established benchmarking pattern.

examples/cast/bench_example_cast.py (1)

1-19: LGTM! Correctly wraps multiple example entry points.

The implementation follows the established pattern and properly wires both FP8 casting examples into the benchmarking framework.

examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)

1-19: LGTM! Correctly wraps multiple example entry points.

The implementation follows the established pattern and properly wires both NSA forward and decode examples into the benchmarking framework.

tilelang/tools/bench.py (2)

11-24: LGTM!

The context manager properly handles resource cleanup in the finally block, ensuring devnull is closed even if exceptions occur during setup.

27-58: LGTM!

The benchmarking logic correctly handles warmup, timing, and failure reporting. Silently ignoring warmup exceptions is appropriate for benchmarking scenarios.

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (2)

5-8: Verify correct decorator usage.

Ensure the CUDA compute version requirement (9.0) matches the hardware constraints of the example_deepgemm_fp8_2xAcc module.

11-16: LGTM!

The main function and entry point follow the standard benchmarking pattern correctly.

examples/gemm/bench_gemm.py (1)

1-29: LGTM!

All benchmark wrapper functions correctly delegate to tilelang.tools.bench.process_func and the main entry point follows the established pattern.

examples/convolution/bench_example_convolution.py (2)

6-13: Verify the decorator mismatch with AI summary.

The AI summary states that bench_example_convolution uses requires_cuda_compute_version_eq(9, 0), but Line 7 shows requires_cuda_compute_version_le(8, 9). Additionally, the summary indicates bench_example_convolution_autotune is decorated with requires_cuda, but Line 12 shows no decorators.

16-21: LGTM!

The main function and entry point are correctly implemented.

examples/flash_decoding/bench_example_flash_decoding.py (2)

6-13: LGTM!

Both benchmark functions correctly delegate to the benchmarking utility. The decorator usage appropriately restricts hardware requirements for the GQA decode benchmark.

16-21: LGTM!

The main function and entry point follow the standard pattern.

examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (2)

9-32: LGTM!

All benchmark functions correctly delegate to the appropriate test functions with proper CUDA constraints where needed.

35-40: LGTM!

The main function and entry point are correctly implemented.

examples/warp_specialize/bench_example_warp_specialize.py (2)

8-29: LGTM!

All benchmark wrapper functions correctly use CUDA decorators and delegate to the appropriate example modules.

32-37: LGTM!

The main function and entry point follow the established pattern correctly.

examples/dequantize_gemm/bench_example_dequantize_gemm.py (2)

10-43: LGTM—consistent benchmark wrappers.

The benchmark functions correctly delegate to process_func and apply appropriate CUDA guards. Compute version 9.0 requirements for Hopper-specific variants are appropriate.

45-50: LGTM—standard entry point.

The main function and script guard follow the established pattern.

examples/blocksparse_attention/bench_example_blocksparse_attention.py (2)

1-7: LGTM—imports are complete.

All necessary imports are present.

34-39: LGTM—standard entry point.

The main function and script guard follow the established pattern.

examples/attention_sink/bench_example_attention_sink.py (3)

1-7: LGTM—imports are complete.

All necessary imports including tilelang.testing are present.

10-64: LGTM—comprehensive attention sink benchmarks.

The benchmark suite covers MHA and GQA variants (forward and backward) with proper CUDA guards and compute version requirements. Sliding window variants correctly pass window_size=128.

66-71: LGTM—standard entry point.

The main function and script guard follow the established pattern.

examples/flash_attention/bench_example_flash_attention.py (2)

17-85: LGTM—comprehensive flash attention benchmarks.

The benchmark suite covers GQA and MHA variants (forward and backward passes, various memory layouts) with proper CUDA guards. Compute version 9.0 requirements for WGMMA-pipelined variants are appropriate.

87-92: LGTM—standard entry point.

The main function and script guard follow the established pattern.

coderabbitai · 2025-11-12T10:17:54Z

examples/blocksparse_attention/bench_example_blocksparse_attention.py

+def bench_block_sparse_attn_triton():
+    tilelang.tools.bench.process_func(block_sparse_attn_triton.main)
+
+
+def bench_example_tilelang_block_sparse_attn():
+    tilelang.tools.bench.process_func(example_tilelang_block_sparse_attn.main)
+
+
+def bench_example_tilelang_sparse_gqa_decode_varlen_indice():
+    tilelang.tools.bench.process_func(example_tilelang_sparse_gqa_decode_varlen_indice.main)
+
+
+def bench_example_tilelang_sparse_gqa_decode_varlen_mask():
+    tilelang.tools.bench.process_func(example_tilelang_sparse_gqa_decode_varlen_mask.main)
+
+
+def bench_example_triton_sparse_gqa_decode_varlen_indice():
+    tilelang.tools.bench.process_func(example_triton_sparse_gqa_decode_varlen_indice.main)
+
+
+def bench_example_triton_sparse_gqa_decode_varlen_mask():
+    tilelang.tools.bench.process_func(example_triton_sparse_gqa_decode_varlen_mask.main)
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify CUDA requirements for block-sparse attention benchmarks.

Unlike other benchmark files in this PR (e.g., bench_example_dequantize_gemm.py, bench_example_attention_sink.py), these functions lack @tilelang.testing.requires_cuda decorators. Since block-sparse attention typically requires GPU execution, confirm whether these benchmarks can run on CPU or if decorators should be added.

Run the following script to check if the imported modules have any CUDA dependencies:

🏁 Script executed:

#!/bin/bash # Check for CUDA usage in the block sparse attention example modules echo "=== Checking for CUDA/GPU usage in example modules ===" for module in block_sparse_attn_triton example_tilelang_block_sparse_attn example_tilelang_sparse_gqa_decode_varlen_indice example_tilelang_sparse_gqa_decode_varlen_mask example_triton_sparse_gqa_decode_varlen_indice example_triton_sparse_gqa_decode_varlen_mask; do echo "--- Module: $module ---" fd "${module}.py" examples/blocksparse_attention --exec cat {} \; | head -50 done

Length of output: 9127

--- 🏁 Script executed: ```shell #!/bin/bash # Check if requires_cuda decorator exists and how it's used in other benchmarks echo "=== Checking for requires_cuda usage in benchmark files ===" fd "bench_.*\.py$" examples --exec grep -l "requires_cuda" {} \; echo -e "\n=== Checking bench_example_dequantize_gemm.py and bench_example_attention_sink.py ===" fd "bench_example_dequantize_gemm.py|bench_example_attention_sink.py" examples --exec head -40 {} \; echo -e "\n=== Checking what tilelang.tools.bench.process_func does ===" rg -n "def process_func" --type py -A 10

Length of output: 4482

--- 🏁 Script executed: ```shell #!/bin/bash # Check imports in the blocksparse benchmark file head -20 examples/blocksparse_attention/bench_example_blocksparse_attention.py

Length of output: 713

Add @tilelang.testing.requires_cuda decorators to all benchmark functions.

The benchmark file is missing CUDA requirement decorators that are consistently used across other benchmark files in the repository. All six benchmark functions in this file call GPU-exclusive kernels (Triton and TileLang JIT-compiled functions), so they require CUDA to run.

Fixes needed:

Add import tilelang.testing at the top (line 1)

Add @tilelang.testing.requires_cuda decorator to all six benchmark functions:

bench_block_sparse_attn_triton() (line 10)

bench_example_tilelang_block_sparse_attn() (line 14)

bench_example_tilelang_sparse_gqa_decode_varlen_indice() (line 18)

bench_example_tilelang_sparse_gqa_decode_varlen_mask() (line 22)

bench_example_triton_sparse_gqa_decode_varlen_indice() (line 26)

bench_example_triton_sparse_gqa_decode_varlen_mask() (line 30)

🤖 Prompt for AI Agents

In examples/blocksparse_attention/bench_example_blocksparse_attention.py around lines 1 and 10-32, the file lacks the CUDA test decorator and import: add "import tilelang.testing" at the top of the file, then prepend "@tilelang.testing.requires_cuda" to each of the six benchmark functions at the specified lines — bench_block_sparse_attn_triton() (line 10), bench_example_tilelang_block_sparse_attn() (line 14), bench_example_tilelang_sparse_gqa_decode_varlen_indice() (line 18), bench_example_tilelang_sparse_gqa_decode_varlen_mask() (line 22), bench_example_triton_sparse_gqa_decode_varlen_indice() (line 26), and bench_example_triton_sparse_gqa_decode_varlen_mask() (line 30) — so each GPU-only benchmark is skipped when CUDA is unavailable.

examples/convolution/bench_example_convolution.py

coderabbitai · 2025-11-12T10:17:55Z

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py

+import tilelang.tools.bench
+import example_deepgemm_fp8_2xAcc


⚠️ Potential issue | 🔴 Critical

Missing import for decorator module.

Lines 5-6 use @tilelang.testing.requires_cuda and @tilelang.testing.requires_cuda_compute_version_eq decorators, but tilelang.testing is not imported.

Add the missing import:

import tilelang.tools.bench +import tilelang.testing import example_deepgemm_fp8_2xAcc

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import tilelang.tools.bench

import example_deepgemm_fp8_2xAcc

import tilelang.tools.bench

import tilelang.testing

import example_deepgemm_fp8_2xAcc

🤖 Prompt for AI Agents

In examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py around lines 1 to 6, the decorators @tilelang.testing.requires_cuda and @tilelang.testing.requires_cuda_compute_version_eq are used but tilelang.testing is not imported; add an import for the testing module (e.g., import tilelang.testing or from tilelang import testing) near the top of the file before the decorator usage so the decorators resolve correctly.

coderabbitai · 2025-11-12T10:17:55Z

examples/deepseek_mla/bench_example_mla_decode.py

+import tilelang.tools.bench
+import example_mla_decode


⚠️ Potential issue | 🔴 Critical

Critical: Missing import for decorators.

The decorators on lines 5-6 reference tilelang.testing but the module is not imported, which will cause a NameError when the module loads.

Apply this diff to add the missing import:

import tilelang.tools.bench +import tilelang.testing import example_mla_decode

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import tilelang.tools.bench

import example_mla_decode

import tilelang.tools.bench

import tilelang.testing

import example_mla_decode

🤖 Prompt for AI Agents

In examples/deepseek_mla/bench_example_mla_decode.py around lines 1 to 2, the decorators later reference tilelang.testing but the module isn't imported, causing a NameError; add an import for the testing module (e.g., import tilelang.testing) at the top of the file so the decorators can resolve correctly.

coderabbitai · 2025-11-12T10:17:55Z

examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py

+import tilelang.tools.bench
+import fp8_lighting_indexer
+import sparse_mla_bwd
+import sparse_mla_fwd
+import sparse_mla_fwd_pipelined
+import topk_selector


⚠️ Potential issue | 🔴 Critical

Missing import for decorator module.

Lines 17-18, 23-24, and 29-30 use @tilelang.testing decorators, but tilelang.testing is not imported.

Add the missing import:

import tilelang.tools.bench +import tilelang.testing import fp8_lighting_indexer import sparse_mla_bwd import sparse_mla_fwd import sparse_mla_fwd_pipelined import topk_selector

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import tilelang.tools.bench

import fp8_lighting_indexer

import sparse_mla_bwd

import sparse_mla_fwd

import sparse_mla_fwd_pipelined

import topk_selector

import tilelang.tools.bench

import tilelang.testing

import fp8_lighting_indexer

import sparse_mla_bwd

import sparse_mla_fwd

import sparse_mla_fwd_pipelined

import topk_selector

🤖 Prompt for AI Agents

In examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py around lines 1-6 (decorators used at lines ~17-30), the module providing the @tilelang.testing decorator isn't imported; add an import for the testing module (e.g., import tilelang.testing) near the other top-level imports so the @tilelang.testing decorators resolve.

coderabbitai · 2025-11-12T10:17:57Z

examples/seer_attention/bench_block_sparse_attn_tilelang.py

+import tilelang.tools.bench
+import block_sparse_attn_tilelang
+


⚠️ Potential issue | 🔴 Critical

Missing import for decorator.

The @tilelang.testing.requires_cuda decorator on line 5 requires importing tilelang.testing, which is currently missing.

Apply this diff to add the missing import:

import tilelang.tools.bench +import tilelang.testing import block_sparse_attn_tilelang

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import tilelang.tools.bench

import block_sparse_attn_tilelang

import tilelang.tools.bench

import tilelang.testing

import block_sparse_attn_tilelang

🤖 Prompt for AI Agents

In examples/seer_attention/bench_block_sparse_attn_tilelang.py around lines 1 to 3, the file uses the @tilelang.testing.requires_cuda decorator but does not import tilelang.testing; add an import for tilelang.testing (e.g., import tilelang.testing) near the top alongside the existing imports so the decorator is resolved.

coderabbitai · 2025-11-12T10:17:57Z

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py

+@tilelang.testing.requires_cuda
+@tilelang.testing.requires_cuda_compute_version(9, 0)
+def bench_example_sparse_tensorcore():
+    tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify runtime CUDA requirement enforcement.

The pytest decorators (@requires_cuda, @requires_cuda_compute_version) only add metadata for pytest runs. When executing this script standalone via bench.main(), these decorators won't skip execution if requirements aren't met—the benchmark will attempt to run and likely fail with an unclear error.

Run this script to check if bench.process_func or the target module has built-in CUDA validation:

🏁 Script executed:

#!/bin/bash # Check if bench.process_func or example modules validate CUDA requirements at runtime # Check bench.process_func implementation for CUDA checks ast-grep --pattern $'def process_func($$$) { $$$ }' | head -100 # Check if the example module has runtime CUDA guards rg -n --type=py -C3 'cuda.*available|torch.*cuda.*is_available|requires.*cuda' tilelang_example_sparse_tensorcore.py

Length of output: 77

🏁 Script executed:

#!/bin/bash # Find relevant files: bench module and example modules echo "=== Finding bench-related files ===" fd -e py 'bench' --type f | head -20 echo "" echo "=== Finding example_sparse_tensorcore file ===" fd 'sparse' --type f | head -20 echo "" echo "=== Searching for process_func definition ===" rg -n 'def process_func' --type py -A 10 | head -40

Length of output: 3080

🏁 Script executed:

#!/bin/bash # Find sparse_tensorcore specifically and check full process_func echo "=== Finding sparse_tensorcore files ===" fd 'sparse_tensorcore' --type f echo "" echo "=== Checking examples directory structure ===" ls -la examples/ | grep -i sparse || echo "No sparse directories in examples" echo "" echo "=== Getting full process_func implementation ===" rg -n 'def process_func' --type py -A 30 tilelang/tools/bench.py echo "" echo "=== Checking for CUDA validation in bench.py ===" rg -n 'cuda|requires_cuda' --type py tilelang/tools/bench.py

Length of output: 1653

🏁 Script executed:

#!/bin/bash # Check the actual example module and benchmark function for runtime CUDA validation echo "=== Contents of bench_example_sparse_tensorcore.py ===" cat examples/sparse_tensorcore/bench_example_sparse_tensorcore.py echo "" echo "=== Checking tilelang_example_sparse_tensorcore.py for CUDA guards ===" cat examples/sparse_tensorcore/tilelang_example_sparse_tensorcore.py | head -50 echo "" echo "=== Search for CUDA runtime checks in example module ===" rg -n 'cuda|device|requires' --type py examples/sparse_tensorcore/tilelang_example_sparse_tensorcore.py | head -20

Length of output: 2519

🏁 Script executed:

#!/bin/bash # Check bench.main() for decorator/CUDA handling echo "=== Search for bench.main() implementation ===" rg -n 'def main' --type py tilelang/tools/bench.py -A 20 echo "" echo "=== Check if bench module handles decorators ===" rg -n 'requires_cuda|decorator|mark' --type py tilelang/tools/bench.py

Length of output: 462

Add runtime CUDA validation to benchmark execution.

Pytest decorators (@requires_cuda, @requires_cuda_compute_version) are metadata-only and don't enforce requirements when bench.main() executes the file via exec() (tilelang/tools/bench.py:66). When bench_example_sparse_tensorcore() runs, it bypasses decorator logic and calls tilelang_example_sparse_tensorcore.main(), which unconditionally creates CUDA tensors (lines 109-111). On incompatible hardware, this fails with a confusing traceback instead of a clear skip message.

Add runtime CUDA validation to bench_example_sparse_tensorcore() before calling process_func():

@tilelang.testing.requires_cuda @tilelang.testing.requires_cuda_compute_version(9, 0) def bench_example_sparse_tensorcore(): if not torch.cuda.is_available(): print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA") return if torch.cuda.get_device_capability()[0] < 9: print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires compute capability >= 9.0") return tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)

🤖 Prompt for AI Agents

In examples/sparse_tensorcore/bench_example_sparse_tensorcore.py around lines 6 to 9, the benchmark relies on pytest decorators only, but the test runner calls the example via exec so decorators don't prevent execution on non-CUDA hardware; add explicit runtime CUDA validation before calling tilelang.tools.bench.process_func: check torch.cuda.is_available() and if false print a clear "SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA" message and return; then check torch.cuda.get_device_capability()[0] (major) and if < 9 print "SKIPPED: {__name__}.bench_example_sparse_tensorcore requires compute capability >= 9.0" and return; keep the existing decorators and only call process_func after both checks pass.

coderabbitai · 2025-11-12T10:17:57Z

examples/topk/bench_topk_tilelang.py

+import tilelang.tools.bench
+import example_topk


⚠️ Potential issue | 🔴 Critical

Critical: Missing import for decorators.

The decorator on line 5 references tilelang.testing but the module is not imported, which will cause a NameError when the module loads.

Apply this diff to add the missing import:

import tilelang.tools.bench +import tilelang.testing import example_topk

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import tilelang.tools.bench

import example_topk

import tilelang.tools.bench

import tilelang.testing

import example_topk

🤖 Prompt for AI Agents

In examples/topk/bench_topk_tilelang.py around lines 1 to 2, the file imports tilelang.tools.bench and example_topk but does not import tilelang.testing which the decorator on line 5 references; add a top-level import for tilelang.testing (e.g., import tilelang.testing) alongside the existing imports so the decorator can resolve without a NameError.

coderabbitai · 2025-11-12T10:17:57Z

examples/warp_specialize/bench_example_warp_specialize.py

+import tilelang.tools.bench
+import example_warp_specialize_gemm_barrierpipe_stage2
+import example_warp_specialize_gemm_copy_0_gemm_1
+import example_warp_specialize_gemm_copy_1_gemm_0
+import example_warp_specialize_gemm_softpipe_stage2


⚠️ Potential issue | 🔴 Critical

Missing import for decorator module.

Lines 8-9, 14-15, 20-21, and 26-27 use @tilelang.testing decorators, but tilelang.testing is not imported.

Add the missing import:

import tilelang.tools.bench +import tilelang.testing import example_warp_specialize_gemm_barrierpipe_stage2 import example_warp_specialize_gemm_copy_0_gemm_1 import example_warp_specialize_gemm_copy_1_gemm_0 import example_warp_specialize_gemm_softpipe_stage2

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

import tilelang.tools.bench

import example_warp_specialize_gemm_barrierpipe_stage2

import example_warp_specialize_gemm_copy_0_gemm_1

import example_warp_specialize_gemm_copy_1_gemm_0

import example_warp_specialize_gemm_softpipe_stage2

import tilelang.tools.bench

import tilelang.testing

import example_warp_specialize_gemm_barrierpipe_stage2

import example_warp_specialize_gemm_copy_0_gemm_1

import example_warp_specialize_gemm_copy_1_gemm_0

import example_warp_specialize_gemm_softpipe_stage2

🤖 Prompt for AI Agents

In examples/warp_specialize/bench_example_warp_specialize.py around lines 1 to 5, the module uses @tilelang.testing decorators later but never imports tilelang.testing; add an import for the testing decorator (e.g., import tilelang.testing or from tilelang import testing) near the top with the other imports so the decorator references resolve.

tilelang/tools/bench.py

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9aa92e1 and c1bb33c.

📒 Files selected for processing (15)

examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1 hunks)
examples/deepseek_mla/bench_example_mla_decode.py (1 hunks)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1 hunks)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)
examples/flash_decoding/bench_example_flash_decoding.py (1 hunks)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1 hunks)
examples/linear_attention/bench_linear_attn.py (1 hunks)
examples/minference/bench_vs_sparse_attn.py (1 hunks)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1 hunks)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1 hunks)
examples/topk/bench_topk_tilelang.py (1 hunks)
examples/warp_specialize/bench_example_warp_specialize.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (9)

examples/topk/bench_topk_tilelang.py
examples/linear_attention/bench_linear_attn.py
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
examples/seer_attention/bench_block_sparse_attn_tilelang.py
examples/dequantize_gemm/bench_example_dequantize_gemm.py
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py
examples/convolution/bench_example_convolution.py
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py
examples/flash_decoding/bench_example_flash_decoding.py

🧰 Additional context used

🧬 Code graph analysis (6)

examples/minference/bench_vs_sparse_attn.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_mla/bench_example_mla_decode.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/attention_sink/bench_example_attention_sink.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/warp_specialize/bench_example_warp_specialize.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (6)

examples/deepseek_v32/sparse_mla_bwd.py (2)

sparse_mla_bwd (283-320)

test_sparse_mla_bwd (334-384)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_v32/topk_selector.py (1)

test_topk_selector (188-245)

examples/deepseek_v32/fp8_lighting_indexer.py (1)

test_fp8_lighting_indexer (260-302)

examples/deepseek_v32/sparse_mla_fwd.py (1)

test_sparse_mla_fwd (253-299)

examples/deepseek_v32/sparse_mla_fwd_pipelined.py (1)

test_sparse_mla_fwd_pipelined (400-452)

examples/flash_attention/bench_example_flash_attention.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

🔇 Additional comments (15)

examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (4)

1-6: Past review comment is no longer applicable.

The previous review flagged a missing tilelang.testing import for decorators, but the current code contains no decorators. This concern has been resolved.

9-26: LGTM!

The benchmark wrapper functions follow a consistent pattern, delegating to process_func with the appropriate test functions. All target test functions have default parameters, so calling them without arguments is safe.

29-30: LGTM!

The main function correctly delegates to the benchmarking framework, which will auto-discover and execute all bench_* functions defined in this module.

33-34: LGTM!

Standard Python entry point pattern correctly implemented.

examples/warp_specialize/bench_example_warp_specialize.py (3)

1-5: LGTM! Past review issue resolved.

The imports are clean and correct. The previous review flagged missing tilelang.testing for decorators, but the current code has removed those decorators entirely, resolving the issue.

8-21: Correct usage of the benchmarking framework.

All four benchmark functions correctly follow the framework pattern: each wraps an example module's main() with process_func and uses the bench_* naming convention for auto-discovery.

24-29: LGTM!

The main() function correctly delegates to the benchmarking framework, and the entry point follows standard Python conventions.

examples/minference/bench_vs_sparse_attn.py (4)

5-6: LGTM!

The bench function correctly wraps the example module's main() using process_func, following the established pattern across the PR. The function reference is passed correctly as a callable without invoking it.

9-10: LGTM!

The main() function correctly delegates to tilelang.tools.bench.main(), which will auto-discover and execute all bench_* functions in this module.

13-14: LGTM!

Standard Python entry point idiom is correctly implemented.

1-2: Imports are correct and verified.

The example_vertical_slash_sparse_attn module exists in the same directory and its main() function accepts no required arguments (argv=None), confirming it can be called without parameters as intended by the benchmarking wrapper.

examples/deepseek_mla/bench_example_mla_decode.py (4)

1-2: LGTM! Imports are correct.

The imports are appropriate for the benchmarking wrapper pattern. Note that the previous review comment about missing tilelang.testing is no longer applicable since there are no decorators in the current code.

9-10: LGTM! Standard entry point pattern.

The main() function correctly delegates to bench.main(), which will auto-discover and execute all bench_* functions in this module.

13-14: LGTM! Standard Python entry point.

The if __name__ == "__main__" guard is correct and allows the script to be executed directly or imported as a module.

5-6: Verification confirms correct implementation.

example_mla_decode.main exists at line 283 in examples/deepseek_mla/example_mla_decode.py and is properly referenced. The function correctly passes the function reference to process_func, and the bench_* naming convention enables auto-discovery. No issues found.

examples/attention_sink/bench_example_attention_sink.py

examples/flash_attention/bench_example_flash_attention.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

examples/convolution/bench_example_convolution.py (1)

1-3: Past review comment appears outdated.

The existing review claims that lines 6-7 use @tilelang.testing.requires_cuda decorators and require importing tilelang.testing. However, the current code shows no decorators on lines 6-7—just plain function definitions. The current imports are sufficient for the code as written.

examples/attention_sink/bench_example_attention_sink.py (1)

55-56: Same __name__ guard issue—fix in bench.py is preferred.

The past review correctly identifies that this guard will raise NameError when tilelang.tools.bench.main() re-executes the file via exec() with an undefined __name__. However, the suggested globals().get("__name__") workaround would need to be duplicated across all ~30 bench scripts.

The root cause is in tilelang/tools/bench.py lines 64-66 where the exec namespace is created empty. Fixing it there (as commented in my review) will resolve the issue for all bench scripts at once.

🧹 Nitpick comments (1)

tilelang/tools/bench.py (1)
63-66: Add defensive checks for frame introspection and file operations.

Several robustness issues:

inspect.getsourcefile() can return None if the frame lacks source information, which would cause line 65 to fail.

The file is opened without explicit encoding (platform default will be used).

sys._getframe(1) is CPython-specific (underscore prefix indicates private API).

While the exec usage is flagged by static analysis (S102), it's likely acceptable for a benchmarking tool in a controlled environment.

Consider adding defensive checks:
 def main():
     # Entry point — automatically run all bench_* functions in caller file.
     test_file = inspect.getsourcefile(sys._getframe(1))
+    if test_file is None:
+        raise RuntimeError("Cannot determine source file for caller")
     module = {}
-    with open(test_file) as f:
+    with open(test_file, encoding="utf-8") as f:
         exec(f.read(), module)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c1bb33c and 72e563e.

📒 Files selected for processing (3)

examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
tilelang/tools/bench.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

tilelang/tools/bench.py (1)

examples/attention_sink/bench_example_attention_sink.py (1)

main (51-52)

examples/convolution/bench_example_convolution.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/attention_sink/bench_example_attention_sink.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

🪛 Ruff (0.14.4)

tilelang/tools/bench.py

33-34: try-except-pass detected, consider logging the exception

(S110)

33-33: Do not catch blind exception: Exception

(BLE001)

45-45: Do not catch blind exception: Exception

(BLE001)

66-66: Use of exec detected

(S102)

🔇 Additional comments (1)

examples/attention_sink/bench_example_attention_sink.py (1)

9-48: LGTM! Clean benchmark organization.

The benchmark functions follow a consistent pattern:

Base configurations call process_func(module.main) directly

Sliding window variants pass window_size=128 as a kwarg

Clear naming convention distinguishes MHA/GQA, forward/backward, and different implementations

The pattern assumes the underlying main() functions accept an optional window_size parameter, which should be verified during testing.

examples/convolution/bench_example_convolution.py

coderabbitai · 2025-11-12T10:42:38Z

tilelang/tools/bench.py

+    module = {}
+    with open(test_file) as f:
+        exec(f.read(), module)


⚠️ Potential issue | 🔴 Critical

Critical: exec namespace breaks __name__ guard in bench scripts.

The empty module = {} dict used as the exec namespace will cause NameError when bench scripts contain if __name__ == "__main__": guards. Python's exec will auto-populate __builtins__ but not __name__, __file__, or other module-level variables.

This explains the past review comment on examples/attention_sink/bench_example_attention_sink.py lines 55-56. All bench scripts using this pattern will fail at runtime when tilelang.tools.bench.main() re-executes them.

Fix by pre-populating the exec namespace with standard module globals:

def main(): # Entry point — automatically run all bench_* functions in caller file. test_file = inspect.getsourcefile(sys._getframe(1)) - module = {} + module = {"__name__": "__benchmarked__", "__file__": test_file} with open(test_file) as f: exec(f.read(), module)

Note: Using "__benchmarked__" (or any value other than "__main__") ensures the guard skips execution during discovery, preventing infinite recursion.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

module = {}

with open(test_file) as f:

exec(f.read(), module)

module = {"__name__": "__benchmarked__", "__file__": test_file}

with open(test_file) as f:

exec(f.read(), module)

🧰 Tools

🪛 Ruff (0.14.4)

66-66: Use of exec detected

(S102)

🤖 Prompt for AI Agents

In tilelang/tools/bench.py around lines 64–66, the current exec(namespace) uses module = {} which doesn't set module-level globals like __name__ or __file__, breaking scripts that use if __name__ == "__main__": guards; to fix, pre-populate the exec namespace with standard module globals before exec, e.g. set '__name__' to a non-main sentinel such as '__benchmarked__' (so discovery doesn't trigger main), set '__file__' to the test file path (and optionally '__package__' if needed), then exec the file into that namespace; replace module = {} with this pre-populated dict so bench scripts run without NameError and avoid re-running their main during discovery.

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (8)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
6-7: Add runtime CUDA validation before calling process_func.

As flagged in a previous review, this benchmark lacks runtime validation of CUDA requirements. When executed via bench.main(), the script bypasses pytest decorator checks and will attempt to run on incompatible hardware, resulting in unclear error messages rather than a clean skip notification.

Add explicit checks before calling process_func:
+import torch
+
 def bench_example_sparse_tensorcore():
+    if not torch.cuda.is_available():
+        print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA")
+        return
+    if torch.cuda.get_device_capability()[0] < 9:
+        print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires compute capability >= 9.0")
+        return
     tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)
examples/gemm/bench_gemm.py (1)

24-25: Consider using standard __name__ check.

Same as the comment on examples/dynamic_shape/bench_example_dynamic.py: the standard if __name__ == "__main__": is more idiomatic.

examples/gemm/bench_example_gemm.py (2)

24-25: Consider using standard __name__ check.

Same as previous files: prefer if __name__ == "__main__": over globals().get("__name__") == "__main__".

1-25: Critical: Duplicate benchmark file detected.

This file is a duplicate of examples/gemm/bench_gemm.py in the same directory. Refer to the comment on that file for details.

examples/cast/bench_example_cast.py (1)

14-15: Consider using standard __name__ check.

Same suggestion as other files: use if __name__ == "__main__": for better readability.

examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (2)

9-10: Consider using standard __name__ check.

Same suggestion as other files.

1-6: Critical: Filename mismatch remains unresolved.

The past review correctly identified that the filename uses "splitk" but the imported module and function name use "streamk". This naming inconsistency should be resolved by renaming the file to bench_example_tilelang_gemm_streamk.py.

examples/flash_decoding/bench_example_flash_decoding.py (1)

14-15: Consider using standard __name__ check.

Same suggestion as other files: prefer the standard if __name__ == "__main__": idiom.

🧹 Nitpick comments (8)

examples/deepseek_mla/bench_example_mla_decode.py (1)
9-10: Use the idiomatic main guard.

The globals().get("__name__") pattern is unnecessarily complex. Use the standard Python idiom for better readability.

Apply this diff:
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()
examples/warp_specialize/bench_example_warp_specialize.py (1)
24-25: Use the standard Python idiom for the main guard.

The globals().get("__name__") pattern is unnecessarily defensive and non-idiomatic. The __name__ variable is always defined in Python modules.

Apply this diff to use the standard idiom:
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
34-35: Use the standard Python idiom for the main guard.

The globals().get("__name__") pattern is unconventional and less clear than the standard Python idiom. The __name__ variable is always available at module level, so the .get() call adds unnecessary complexity.

Apply this diff to use the standard idiom:
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (2)
1-3: Consider removing unused tilelang import.

The tilelang module is imported but not directly used in this script. Unless it's required for side effects or future decorator additions, consider removing it.
-import tilelang.tools.bench
-import tilelang
-import tilelang_example_sparse_tensorcore
+import tilelang.tools.bench
+import tilelang_example_sparse_tensorcore
10-11: Simplify main guard to use standard Python idiom.

The globals().get("__name__") pattern works but is unnecessarily indirect. The standard __name__ check is clearer and more idiomatic.
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
9-10: Use standard Python idiom for main guard.

The pattern globals().get("__name__") is unnecessarily complex. The standard Python idiom is clearer and more widely recognized.

Apply this diff:
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)
14-15: Simplify to standard Python idiom.

The use of globals().get("__name__") is unnecessarily verbose. Use the standard __name__ check instead.

Apply this diff:
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()
examples/dynamic_shape/bench_example_dynamic.py (1)
9-10: Consider using standard __name__ check.

The globals().get("__name__") pattern is unnecessarily verbose. The standard Python idiom if __name__ == "__main__": is more readable and widely recognized.

Apply this diff:
-if globals().get("__name__") == "__main__":
+if __name__ == "__main__":
     tilelang.tools.bench.main()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72e563e and 4dcdeef.

📒 Files selected for processing (28)

examples/analyze/bench_example_analyze.py (1 hunks)
examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1 hunks)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1 hunks)
examples/cast/bench_example_cast.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1 hunks)
examples/deepseek_mla/bench_example_mla_decode.py (1 hunks)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1 hunks)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1 hunks)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1 hunks)
examples/dynamic_shape/bench_example_dynamic.py (1 hunks)
examples/elementwise/bench_example_elementwise.py (1 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)
examples/flash_decoding/bench_example_flash_decoding.py (1 hunks)
examples/fusedmoe/bench_example_fusedmoe.py (1 hunks)
examples/gemm/bench_example_gemm.py (1 hunks)
examples/gemm/bench_gemm.py (1 hunks)
examples/gemm_fp8/bench_example_gemm_fp8.py (1 hunks)
examples/gemm_splitk/bench_example_gemm_splitk.py (1 hunks)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1 hunks)
examples/gemv/bench_example_gemv.py (1 hunks)
examples/linear_attention/bench_linear_attn.py (1 hunks)
examples/minference/bench_vs_sparse_attn.py (1 hunks)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1 hunks)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1 hunks)
examples/topk/bench_topk_tilelang.py (1 hunks)
examples/warp_specialize/bench_example_warp_specialize.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (13)

examples/blocksparse_gemm/bench_example_blocksparse_gemm.py
examples/gemv/bench_example_gemv.py
examples/elementwise/bench_example_elementwise.py
examples/seer_attention/bench_block_sparse_attn_tilelang.py
examples/topk/bench_topk_tilelang.py
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
examples/linear_attention/bench_linear_attn.py
examples/gemm_fp8/bench_example_gemm_fp8.py
examples/fusedmoe/bench_example_fusedmoe.py
examples/minference/bench_vs_sparse_attn.py
examples/dequantize_gemm/bench_example_dequantize_gemm.py
examples/gemm_splitk/bench_example_gemm_splitk.py
examples/convolution/bench_example_convolution.py

🧰 Additional context used

🧬 Code graph analysis (15)

examples/deepseek_mla/bench_example_mla_decode.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/dynamic_shape/bench_example_dynamic.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/flash_decoding/bench_example_flash_decoding.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/attention_sink/bench_example_attention_sink.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/analyze/bench_example_analyze.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemm/bench_example_gemm.py (2)

examples/gemm/bench_gemm.py (4)

bench_example_gemm_autotune (8-9)

bench_example_gemm_intrinsics (12-13)

bench_example_gemm_schedule (16-17)

bench_example_gemm (20-21)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/cast/bench_example_cast.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/gemm/bench_gemm.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/warp_specialize/bench_example_warp_specialize.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

examples/flash_attention/bench_example_flash_attention.py (1)

tilelang/tools/bench.py (2)

process_func (27-58)

main (61-70)

🔇 Additional comments (17)

examples/deepseek_mla/bench_example_mla_decode.py (2)

1-2: Imports look correct.

The necessary modules are imported. Note that the past review comment about missing tilelang.testing import appears to be outdated, as there are no decorators in the current code that would require it.

5-6: LGTM!

The benchmark function correctly delegates to process_func and follows the established pattern used across other benchmark wrappers in this PR.

examples/warp_specialize/bench_example_warp_specialize.py (2)

1-5: Imports look good.

The required imports are present and correctly structured. The past review concern about missing tilelang.testing appears to have been resolved by removing the decorators.

8-21: Benchmark functions follow the expected pattern.

Each function correctly delegates to the imported module's main via process_func. Using the default repeat=10 and warmup=3 parameters is reasonable for general benchmarking.

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (2)

1-2: Imports look correct.

Both imports are properly used in the benchmarking wrapper.

5-6: Benchmark function correctly delegates to the benchmarking utility.

The implementation properly passes the function object to process_func, which will handle the repeated execution and timing.

examples/analyze/bench_example_analyze.py (3)

1-3: LGTM!

The imports are clean and correctly reference the benchmark utility and the example modules to be benchmarked.

6-11: LGTM!

Both benchmark functions correctly wrap their respective example modules' main() functions using the benchmarking utility pattern.

14-15: LGTM!

The globals().get("__name__") guard correctly handles both direct script execution and re-execution via exec() in the benchmark harness.

examples/flash_attention/bench_example_flash_attention.py (3)

1-14: LGTM!

All imports are correctly referenced by the benchmark functions below. The import list comprehensively covers GQA/MHA variants for forward and backward passes.

17-66: LGTM!

All 12 benchmark functions consistently follow the correct pattern, wrapping their respective example modules' main() functions with the benchmarking utility.

69-70: LGTM!

The script entry guard correctly uses the globals().get("__name__") pattern to handle both direct execution and benchmark harness re-execution.

examples/deepseek_nsa/bench_example_tilelang_nsa.py (2)

10-11: LGTM!

The benchmark wrapper follows the same pattern as the first function and is consistent with the framework design.

6-7: No actionable issues found; code correctly implements the assumed contract.

The wrapped function example_tilelang_nsa_fwd.main is properly imported (line 2) and correctly referenced in the benchmark wrapper. The main function is defined with no required parameters (line 135 of example_tilelang_nsa_fwd.py), and the call at line 7 of the bench file correctly passes it to process_func without arguments.

examples/attention_sink/bench_example_attention_sink.py (3)

51-52: LGTM! The critical __name__ issue has been resolved.

The entry point correctly uses globals().get("__name__") instead of directly accessing __name__, which prevents the NameError when the bench harness execs this file in an empty namespace. This allows the script to run standalone while skipping the guard cleanly during harness execution.

9-48: LGTM! All target main() functions correctly accept the window_size parameter.

All five target example modules' main() functions have the window_size: Optional[int] = None parameter and properly handle it. The bench wrapper functions correctly delegate to process_func, and the _sliding_window variants appropriately pass window_size=128 to their target functions.

1-6: I need to examine the actual repository structure to verify whether this is part of the TileLang repository or a separate project.

This file is part of the TileLang repository, a domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels. The test failure shows that tilelang is not installed in the test environment, which is the expected reason the script failed to run—this is an environment setup issue, not a code problem.

The imports in lines 1-6 are correct. tilelang.tools.bench is part of the standard tilelang package structure, and when tilelang is properly installed (as intended in the TileLang repository context), these imports will resolve correctly. The example modules imported in lines 2-6 are local to the examples directory and are designed to be imported when the benchmark script runs in the proper environment.

The review comment incorrectly frames this as an environment-dependent sys.path resolution issue requiring special invocation, when in fact the imports follow standard Python package conventions and will work correctly once tilelang is installed.

Likely an incorrect or invalid review comment.

coderabbitai · 2025-11-12T11:16:44Z

examples/deepseek_nsa/bench_example_tilelang_nsa.py

+import example_tilelang_nsa_fwd
+import example_tilelang_nsa_decode


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that the example modules are importable.

The relative imports assume the example modules are in the same directory or on PYTHONPATH. Ensure that these modules exist and can be imported when the script is run.

Run the following script to verify the modules exist:

🏁 Script executed:

#!/bin/bash # Description: Verify the example modules exist in the expected location # Check if the example modules exist fd -t f "example_tilelang_nsa_fwd.py" examples/deepseek_nsa/ fd -t f "example_tilelang_nsa_decode.py" examples/deepseek_nsa/

Length of output: 102

--- 🏁 Script executed: ```shell #!/bin/bash # Check package structure and verify imports fd -t f "__init__.py" examples/deepseek_nsa/ fd -t f "__init__.py" examples/ # Check the actual bench_example file structure head -20 examples/deepseek_nsa/bench_example_tilelang_nsa.py

Length of output: 400

--- 🏁 Script executed: ```shell #!/bin/bash # Check for setup files or documentation about running examples fd -t f "setup.py|pyproject.toml|README|run" examples/deepseek_nsa/ -o examples/ # Check if there's any sys.path manipulation in the script or parent modules rg -n "sys.path" examples/deepseek_nsa/

Length of output: 137

--- 🏁 Script executed: ```shell #!/bin/bash # Search for setup/documentation files find examples/deepseek_nsa -type f -name "*.py" | head -20 find examples -maxdepth 2 -type f $ -name "setup.py" -o -name "pyproject.toml" -o -name "README*" $ # Check for sys.path manipulation rg -n "sys.path" examples/

Length of output: 2643

--- 🏁 Script executed: ```shell #!/bin/bash # Check how test_example_tilelang_nsa.py handles imports head -10 examples/deepseek_nsa/test_example_tilelang_nsa.py # Also check if any other benchmark files exist and how they handle imports find examples/deepseek_nsa -name "bench_*.py" -type f head -10 examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py 2>/dev/null || echo "File not found or error"

Length of output: 464

Fix bare imports to use relative syntax or add sys.path setup.

Lines 2-3 use bare absolute imports that will fail at runtime unless examples/deepseek_nsa/ is on sys.path. Since this directory is not a package (no __init__.py), use relative imports instead:

from . import example_tilelang_nsa_fwd from . import example_tilelang_nsa_decode

Alternatively, add to the top of the file:

import sys import os sys.path.insert(0, os.path.dirname(__file__))

The test file (test_example_tilelang_nsa.py) in the same directory has the same issue.

🤖 Prompt for AI Agents

In examples/deepseek_nsa/bench_example_tilelang_nsa.py around lines 2-3 the bare absolute imports will fail because the directory is not on sys.path and isn’t a package; change to relative imports (from . import example_tilelang_nsa_fwd and from . import example_tilelang_nsa_decode) or, if you prefer module-style imports, add at the top: import sys, os and sys.path.insert(0, os.path.dirname(__file__)) so the local modules can be found (apply the same fix to test_example_tilelang_nsa.py).

examples/gemm/bench_gemm.py

LeiWang1999 · 2025-11-12T13:37:39Z

@yyttt6 would you mind give us a doc about this part?

yyttt6 · 2025-11-12T14:11:20Z

https://docs.google.com/document/d/1dEUSlnmKinMmCguhy7otTi_vDepqERNGfT8UeKt03bA/edit?usp=sharing

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (1)

tilelang/tools/bench.py (1)
104-106: Critical: exec namespace breaks __name__ guard in bench scripts.

The empty module = {} dict causes NameError when bench scripts contain standard if __name__ == "__main__": guards. Python's exec populates __builtins__ but not __name__, __file__, or other module-level variables.

This is why all benchmark scripts use the workaround globals().get("__name__") instead of the standard __name__ check.

Fix by pre-populating the exec namespace:
 def main():
     # Entry point — automatically run all bench_* functions in caller file.
     test_file = inspect.getsourcefile(sys._getframe(1))
-    module = {}
+    module = {"__name__": "__benchmarked__", "__file__": test_file, "__builtins__": __builtins__}
     with open(test_file) as f:
         exec(f.read(), module)
Note: Using "__benchmarked__" (not "__main__") prevents the guard from triggering during discovery, avoiding infinite recursion.

🧹 Nitpick comments (1)

tilelang/tools/bench.py (1)

48-50: Blind exception catching may mask bugs in benchmark functions.

While printing the traceback, catching all exceptions without distinguishing between transient errors (e.g., CUDA OOM) and persistent bugs (e.g., TypeError from incorrect arguments) makes debugging harder.

Consider limiting the exception scope or adding more detailed error classification in a future iteration.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 519017a and 0726946.

📒 Files selected for processing (28)

examples/analyze/bench_example_analyze.py (1 hunks)
examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1 hunks)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1 hunks)
examples/cast/bench_example_cast.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1 hunks)
examples/deepseek_mla/bench_example_mla_decode.py (1 hunks)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1 hunks)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1 hunks)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1 hunks)
examples/dynamic_shape/bench_example_dynamic.py (1 hunks)
examples/elementwise/bench_example_elementwise.py (1 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)
examples/flash_decoding/bench_example_flash_decoding.py (1 hunks)
examples/fusedmoe/bench_example_fusedmoe.py (1 hunks)
examples/gemm/bench_example_gemm.py (1 hunks)
examples/gemm_fp8/bench_example_gemm_fp8.py (1 hunks)
examples/gemm_splitk/bench_example_gemm_splitk.py (1 hunks)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1 hunks)
examples/gemv/bench_example_gemv.py (1 hunks)
examples/linear_attention/bench_linear_attn.py (1 hunks)
examples/minference/bench_vs_sparse_attn.py (1 hunks)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1 hunks)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1 hunks)
examples/topk/bench_topk_tilelang.py (1 hunks)
examples/warp_specialize/bench_example_warp_specialize.py (1 hunks)
tilelang/tools/bench.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (14)

examples/analyze/bench_example_analyze.py
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py
examples/gemm_splitk/bench_example_gemm_splitk.py
examples/minference/bench_vs_sparse_attn.py
examples/deepseek_nsa/bench_example_tilelang_nsa.py
examples/seer_attention/bench_block_sparse_attn_tilelang.py
examples/gemm/bench_example_gemm.py
examples/elementwise/bench_example_elementwise.py
examples/linear_attention/bench_linear_attn.py
examples/topk/bench_topk_tilelang.py
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
examples/dequantize_gemm/bench_example_dequantize_gemm.py
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py

🧰 Additional context used

🧬 Code graph analysis (13)

examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/deepseek_mla/bench_example_mla_decode.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/flash_decoding/bench_example_flash_decoding.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/cast/bench_example_cast.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/gemv/bench_example_gemv.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/convolution/bench_example_convolution.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/fusedmoe/bench_example_fusedmoe.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/attention_sink/bench_example_attention_sink.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/dynamic_shape/bench_example_dynamic.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/flash_attention/bench_example_flash_attention.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

examples/warp_specialize/bench_example_warp_specialize.py (1)

tilelang/tools/bench.py (2)

process_func (30-68)

main (101-112)

🪛 Ruff (0.14.4)

tilelang/tools/bench.py

36-37: try-except-pass detected, consider logging the exception

(S110)

36-36: Do not catch blind exception: Exception

(BLE001)

48-48: Do not catch blind exception: Exception

(BLE001)

106-106: Use of exec detected

(S102)

🔇 Additional comments (13)

examples/cast/bench_example_cast.py (1)

1-15: LGTM! Benchmark wrapper follows the standard pattern.

The use of globals().get("__name__") in the guard is the correct workaround for the exec namespace issue in tilelang/tools/bench.py. Once that core issue is fixed, this can be simplified to the standard if __name__ == "__main__": pattern.

examples/dynamic_shape/bench_example_dynamic.py (1)

1-10: LGTM! Standard benchmark wrapper implementation.

Follows the established pattern correctly. The globals().get("__name__") workaround is appropriate given the current state of tilelang/tools/bench.py.

examples/gemv/bench_example_gemv.py (1)

1-10: LGTM! Consistent with other benchmark wrappers.

Implementation is correct and follows the standard pattern.

examples/fusedmoe/bench_example_fusedmoe.py (1)

1-10: LGTM! Benchmark wrapper implemented correctly.

Follows the established benchmarking pattern consistently.

examples/convolution/bench_example_convolution.py (1)

1-15: LGTM! Benchmark wrapper with multiple entry points.

Both benchmark functions are correctly implemented. The globals().get("__name__") guard is appropriate given the current bench.py implementation.

examples/deepseek_mla/bench_example_mla_decode.py (1)

1-10: LGTM! Clean benchmark wrapper implementation.

Correctly implements the benchmarking pattern with appropriate imports and guard.

examples/flash_decoding/bench_example_flash_decoding.py (1)

1-15: LGTM! Multiple benchmarks properly registered.

Both bench_example_gqa_decode and bench_example_mha_inference are correctly implemented following the standard pattern.

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

1-20: LGTM! Consistent benchmark wrapper implementation.

The file correctly integrates three FP8 GEMM examples with the benchmarking framework. Each wrapper function properly delegates to the corresponding module's main entry point, and the script guard uses the correct globals().get("__name__") pattern for compatibility with the exec-based discovery in bench.main().

examples/warp_specialize/bench_example_warp_specialize.py (1)

1-25: LGTM! Clean benchmark wrapper for warp specialization examples.

The file correctly wraps four warp-specialized GEMM examples with the benchmarking harness. The implementation is consistent with the framework pattern and the entry point guard is correct.

examples/attention_sink/bench_example_attention_sink.py (1)

1-52: LGTM! Comprehensive attention sink benchmarking suite.

The file correctly wraps multiple MHA/GQA attention sink variants, including sliding-window configurations. The use of window_size=128 as a keyword argument to process_func is valid and provides useful benchmark variations. The implementation is consistent with the framework design.

examples/flash_attention/bench_example_flash_attention.py (1)

1-70: LGTM! Extensive flash attention benchmark coverage.

The file provides comprehensive benchmark wrappers for 13 flash attention variants covering GQA/MHA, forward/backward passes, different tensor layouts, WGMMA pipelining, and variable-length sequences. The implementation is consistent and correct.

examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)

1-10: LGTM! Simple and correct benchmark wrapper.

The file correctly wraps the block-sparse GEMM example with the benchmarking framework. The minimal implementation is appropriate for a single benchmark target.

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

1-35: LGTM! Block-sparse attention benchmarks correctly integrated.

The file correctly wraps six block-sparse attention variants (Triton and TileLang implementations) with the benchmarking framework. The implementation follows the established pattern and is functionally correct. The bench.process_func exception handling will gracefully handle any GPU availability issues.

coderabbitai · 2025-11-13T14:50:25Z

tilelang/tools/bench.py

+    try:
+        with suppress_output():
+            for _ in range(warmup):
+                func(*args, **kwargs)
+    except Exception:
+        pass


⚠️ Potential issue | 🟠 Major

Silently ignoring warmup failures may hide configuration issues.

The try-except-pass during warmup suppresses all exceptions, making it impossible to diagnose setup problems (missing dependencies, incorrect arguments, environment issues). If warmup fails, the actual timing runs will likely fail too, but the root cause will be unclear.

Consider logging warmup failures or at least emitting a warning:

try: with suppress_output(): for _ in range(warmup): func(*args, **kwargs) - except Exception: - pass + except Exception as e: + warnings.warn( + f"Warmup failed for {func.__module__}: {type(e).__name__}", + RuntimeWarning, + stacklevel=2, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

with suppress_output():

for _ in range(warmup):

func(*args, **kwargs)

except Exception:

pass

try:

with suppress_output():

for _ in range(warmup):

func(*args, **kwargs)

except Exception as e:

warnings.warn(

f"Warmup failed for {func.__module__}: {type(e).__name__}",

RuntimeWarning,

stacklevel=2,

)

🧰 Tools

🪛 Ruff (0.14.4)

36-37: try-except-pass detected, consider logging the exception

(S110)

36-36: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In tilelang/tools/bench.py around lines 32 to 37, the warmup loop currently swallows all exceptions with a bare except/pass which hides configuration/setup errors; change it to catch Exception as e and log a warning or error that includes the exception details (use the module logger or Python's logging with exc_info=True or logger.exception) so failures during warmup are visible, but keep the suppressed output context if desired; do not remove the warmup loop — just replace the silent pass with a logged message that includes the exception and contextual info (function name, args) so the root cause can be diagnosed.

coderabbitai · 2025-11-13T14:50:26Z

tilelang/tools/bench.py

+def analyze_records(records):
+    # Analyze the data and draw a chart
+    records.sort(key=lambda x: x[1])
+    name_col_width = max(len(r[0]) for r in records)
+    safe_width = name_col_width + 20
+    print("=" * safe_width)
+    print(f"{'Function':<{name_col_width}} | Avg Latency (ms)")
+    print("-" * safe_width)
+    for name, lat in records:
+        print(f"{name:<{name_col_width}} | {lat:>10.4f}")
+    print("=" * safe_width)
+
+    names = [r[0] for r in records]
+    lats = [r[1] for r in records]
+    plt.figure(figsize=(max(len(names) * 2.2, 6), 6))
+    plt.bar(names, lats)
+    plt.xlabel("Latency (ms)")
+    plt.title("Benchmark Results")
+
+    test_file = inspect.getsourcefile(sys._getframe(2))
+    out_dir = os.path.dirname(test_file)
+    out_path = os.path.join(out_dir, "bench_result.png")
+
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=200)
+    plt.close()
+
+    print(f"Saved Bar chart to {out_path}")


⚠️ Potential issue | 🟠 Major

analyze_records will crash if no benchmarks are recorded.

Line 74 computes max(len(r[0]) for r in records) without checking if records is empty, causing a ValueError when all benchmarks fail or when called with an empty list.

Apply this diff to handle empty records gracefully:

def analyze_records(records): # Analyze the data and draw a chart + if not records: + warnings.warn("No benchmark records to analyze", RuntimeWarning, stacklevel=2) + return records.sort(key=lambda x: x[1]) name_col_width = max(len(r[0]) for r in records)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def analyze_records(records):

# Analyze the data and draw a chart

records.sort(key=lambda x: x[1])

name_col_width = max(len(r[0]) for r in records)

safe_width = name_col_width + 20

print("=" * safe_width)

print(f"{'Function':<{name_col_width}} | Avg Latency (ms)")

print("-" * safe_width)

for name, lat in records:

print(f"{name:<{name_col_width}} | {lat:>10.4f}")

print("=" * safe_width)

names = [r[0] for r in records]

lats = [r[1] for r in records]

plt.figure(figsize=(max(len(names) * 2.2, 6), 6))

plt.bar(names, lats)

plt.xlabel("Latency (ms)")

plt.title("Benchmark Results")

test_file = inspect.getsourcefile(sys._getframe(2))

out_dir = os.path.dirname(test_file)

out_path = os.path.join(out_dir, "bench_result.png")

plt.tight_layout()

plt.savefig(out_path, dpi=200)

plt.close()

print(f"Saved Bar chart to {out_path}")

def analyze_records(records):

# Analyze the data and draw a chart

if not records:

warnings.warn("No benchmark records to analyze", RuntimeWarning, stacklevel=2)

return

records.sort(key=lambda x: x[1])

name_col_width = max(len(r[0]) for r in records)

safe_width = name_col_width + 20

print("=" * safe_width)

print(f"{'Function':<{name_col_width}} | Avg Latency (ms)")

print("-" * safe_width)

for name, lat in records:

print(f"{name:<{name_col_width}} | {lat:>10.4f}")

print("=" * safe_width)

names = [r[0] for r in records]

lats = [r[1] for r in records]

plt.figure(figsize=(max(len(names) * 2.2, 6), 6))

plt.bar(names, lats)

plt.xlabel("Latency (ms)")

plt.title("Benchmark Results")

test_file = inspect.getsourcefile(sys._getframe(2))

out_dir = os.path.dirname(test_file)

out_path = os.path.join(out_dir, "bench_result.png")

plt.tight_layout()

plt.savefig(out_path, dpi=200)

plt.close()

print(f"Saved Bar chart to {out_path}")

🤖 Prompt for AI Agents

In tilelang/tools/bench.py around lines 71 to 98, analyze_records assumes records is non-empty (uses max(...) and indexing) which raises ValueError for empty lists; add an early guard at the top like "if not records: print a clear message (e.g., 'No benchmark records to analyze'), and return" so the function exits before computing name_col_width, creating plots, or trying to save a file; keep the rest of the function unchanged so normal behavior proceeds when records are present.

yyttt6 · 2025-11-16T03:19:51Z

/perf

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (4)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
1-11: Add runtime CUDA checks before running sparse_tensorcore benchmark

This benchmark unconditionally calls tilelang_example_sparse_tensorcore.main() via process_func, so on machines without CUDA or with insufficient compute capability it will fail with a noisy traceback instead of a clear skip.

Consider adding explicit runtime validation before invoking process_func, e.g.:
-import tilelang.tools.bench
-import tilelang
-import tilelang_example_sparse_tensorcore
+import tilelang.tools.bench
+import tilelang
+import tilelang_example_sparse_tensorcore
+import torch
@@
-def bench_example_sparse_tensorcore():
-    tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)
+def bench_example_sparse_tensorcore():
+    if not torch.cuda.is_available():
+        print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA")
+        return
+    major, _ = torch.cuda.get_device_capability()
+    if major < 9:
+        print(
+            f"SKIPPED: {__name__}.bench_example_sparse_tensorcore "
+            "requires compute capability >= 9.0"
+        )
+        return
+    tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)
This keeps the benchmark harness simple while giving clear feedback when hardware requirements aren’t met.
tilelang/tools/bench.py (3)

33-40: Silent warmup failures can hide configuration issues.

The warmup phase suppresses all exceptions without logging. If a benchmark has setup problems (missing dependencies, incorrect arguments, environment issues), users won't know why subsequent runs fail.

This issue was previously flagged in past reviews.

74-76: Missing guard for empty records.

Line 76 sorts records without checking if it's empty. If all benchmarks fail or no benchmarks are registered, subsequent operations (especially max(len(r[0]) for r in records) if it were added, or chart generation) could fail.

This issue was previously flagged in past reviews.

96-108: Fragile exec namespace relies on bench script workarounds.

Line 100 initializes module = {} for the exec namespace without populating standard module globals like __name__ or __file__. This forces all bench scripts to use globals().get("__name__") instead of the natural __name__ idiom. While the workaround functions, fixing the namespace here would make bench scripts more maintainable.

This issue was previously flagged in past reviews.

🧹 Nitpick comments (1)

examples/flash_attention/bench_example_flash_attention.py (1)
17-99: Consider adding docstrings for better maintainability.

Each benchmark function could benefit from a brief docstring explaining:

What example it benchmarks

Why specific parameter values were chosen (e.g., for fair comparison or specific hardware constraints)

Any notable differences from related benchmarks

Example:
def bench_example_gqa_fwd_bshd():
    """Benchmark GQA forward pass with batch-sequence-head-dimension layout.
    
    Uses batch=1, seq_len=1024, heads=16 for comparison with pipelined variant.
    """
    tilelang.tools.bench.process_func(
        example_gqa_fwd_bshd.main,
        batch=1,
        heads=16,
        seq_len=1024,
        dim=128,
        is_causal=False,
        groups=16,
        tune=False)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0726946 and 90d37f3.

📒 Files selected for processing (31)

.github/workflows/pr-perfbench-bot.yml (1 hunks)
examples/analyze/bench_example_analyze.py (1 hunks)
examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1 hunks)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1 hunks)
examples/cast/bench_example_cast.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1 hunks)
examples/deepseek_mla/bench_example_mla_decode.py (1 hunks)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1 hunks)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1 hunks)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1 hunks)
examples/dynamic_shape/bench_example_dynamic.py (1 hunks)
examples/elementwise/bench_example_elementwise.py (1 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)
examples/flash_decoding/bench_example_flash_decoding.py (1 hunks)
examples/fusedmoe/bench_example_fusedmoe.py (1 hunks)
examples/gemm/bench_example_gemm.py (1 hunks)
examples/gemm_fp8/bench_example_gemm_fp8.py (1 hunks)
examples/gemm_splitk/bench_example_gemm_splitk.py (1 hunks)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1 hunks)
examples/gemv/bench_example_gemv.py (1 hunks)
examples/linear_attention/bench_linear_attn.py (1 hunks)
examples/minference/bench_vs_sparse_attn.py (1 hunks)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1 hunks)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1 hunks)
examples/topk/bench_topk_tilelang.py (1 hunks)
examples/warp_specialize/bench_example_warp_specialize.py (1 hunks)
log.txt (1 hunks)
maint/scripts/ci_performance.py (1 hunks)
tilelang/tools/bench.py (1 hunks)

✅ Files skipped from review due to trivial changes (1)

log.txt

🚧 Files skipped from review as they are similar to previous changes (16)

examples/flash_decoding/bench_example_flash_decoding.py
examples/analyze/bench_example_analyze.py
examples/blocksparse_attention/bench_example_blocksparse_attention.py
examples/convolution/bench_example_convolution.py
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
examples/gemv/bench_example_gemv.py
examples/fusedmoe/bench_example_fusedmoe.py
examples/warp_specialize/bench_example_warp_specialize.py
examples/deepseek_nsa/bench_example_tilelang_nsa.py
examples/gemm/bench_example_gemm.py
examples/elementwise/bench_example_elementwise.py
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
examples/topk/bench_topk_tilelang.py
examples/gemm_splitk/bench_example_gemm_splitk.py
examples/seer_attention/bench_block_sparse_attn_tilelang.py
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py

🧰 Additional context used

🧬 Code graph analysis (12)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/deepseek_mla/bench_example_mla_decode.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/minference/bench_vs_sparse_attn.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/dynamic_shape/bench_example_dynamic.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

maint/scripts/ci_performance.py (2)

tilelang/env.py (1)

disable_cache (275-276)

maint/scripts/performance.py (1)

run (22-69)

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/dequantize_gemm/bench_example_dequantize_gemm.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/attention_sink/bench_example_attention_sink.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/cast/bench_example_cast.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/linear_attention/bench_linear_attn.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/flash_attention/bench_example_flash_attention.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

🪛 Ruff (0.14.4)

tilelang/tools/bench.py

39-40: try-except-pass detected, consider logging the exception

(S110)

39-39: Do not catch blind exception: Exception

(BLE001)

51-51: Do not catch blind exception: Exception

(BLE001)

102-102: Use of exec detected

(S102)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Quick Lint

🔇 Additional comments (14)

examples/deepseek_mla/bench_example_mla_decode.py (1)

1-10: Wrapper is consistent with bench harness

This follows the shared pattern (bench_* wrapper + bench.main entrypoint) and looks correct. Just ensure example_mla_decode.main doesn’t require additional arguments (e.g., argv) beyond what process_func supplies.

examples/minference/bench_vs_sparse_attn.py (1)

1-10: Bench wrapper looks correct and consistent

The wrapper correctly forwards to tilelang.tools.bench.process_func with argv=[] and wires up bench.main() as a script entrypoint. This is consistent with the pattern used by other example benchmarks.
maint/scripts/ci_performance.py (1)
4-7: Handle key mismatches and empty results more robustly

A couple of edge cases can break this script and the CI job:

Assumes identical key sets between v1 and v2.

Lines 34–36 iterate data_v1.keys() and directly access data_v2[key], which will raise KeyError if a benchmark exists in only one output (e.g., new bench added in PR, or one removed from main). Since the two subprocess calls run different Python environments, their outputs can diverge.

Division by zero when v2 latency is zero.

Line 35 computes data_v1[key] / data_v2[key] with no guard; if data_v2[key] == 0, this raises ZeroDivisionError.

Figure width can be zero when there are no valid rows.

If parsing yields no entries (e.g., bench_all fails or format changes), table is empty ⇒ df empty ⇒ fig_width = max(0, len(df) * 0.35) gives 0 (Line 48), which is invalid for plt.figure(figsize=(fig_width, 8)).

Suggested changes

Use the intersection of keys and guard against missing or zero entries:
 table = []
-for key in data_v1.keys():
+common_keys = sorted(set(data_v1.keys()) & set(data_v2.keys()))
+for key in common_keys:
+    if data_v2[key] == 0:
+        continue  # avoid division by zero
     speedup = data_v1[key] / data_v2[key]
     table.append([key, data_v1[key], data_v2[key], speedup])
Short-circuit when table is empty:
 with open("bench.md", "w") as f:
     f.write(
         tabulate(table, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))
     f.write("\n")
 
+if not table:
+    raise RuntimeError("No benchmark data parsed from bench_all output")
+
 df = pd.DataFrame(table, columns=headers)
 df = df.sort_values("Speedup", ascending=False).reset_index(drop=True)
-fig_width = max(0, len(df) * 0.35)
+fig_width = max(4, len(df) * 0.35)
Likely an incorrect or invalid review comment.
examples/dynamic_shape/bench_example_dynamic.py (1)

1-10: LGTM! Standard benchmark wrapper pattern.

The file correctly follows the established benchmarking pattern: imports the bench framework, defines a bench_* function delegating to process_func, and uses the globals().get("__name__") guard for the main entry point.

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)

1-10: LGTM! Clean benchmark wrapper.

The implementation follows the standard pattern correctly. Note that the past review comment about missing tilelang.testing import appears to no longer apply, as the current code does not use any decorators.

examples/cast/bench_example_cast.py (1)

1-21: LGTM! Benchmark wrappers are well-structured.

Both benchmark functions properly delegate to their respective example modules with appropriate arguments. The implementation is clean and follows the established pattern.

examples/linear_attention/bench_linear_attn.py (1)

1-15: LGTM! Forward and backward benchmark wrappers.

The file correctly implements benchmark wrappers for both forward and backward linear attention examples. The past review concern about missing imports does not apply to the current code.

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

1-20: LGTM! Comprehensive FP8 GEMM benchmark coverage.

The file provides benchmark wrappers for three FP8 GEMM variants (2xAcc, intrinsic, and standard), properly integrating them with the benchmarking framework.

examples/dequantize_gemm/bench_example_dequantize_gemm.py (1)

1-35: LGTM! Extensive dequantize GEMM benchmark suite.

The file properly implements benchmark wrappers for six different dequantize GEMM variants, covering FP4, MxFP4, W4A8, and TMA variants. The implementation is consistent and correct.

tilelang/tools/bench.py (1)

111-144: LGTM! The bench_all() discovery mechanism is well-designed.

The function properly walks the examples directory, dynamically loads modules, discovers bench_* functions, and executes them. The sys.path management ensures imports resolve correctly.

examples/attention_sink/bench_example_attention_sink.py (1)

1-52: LGTM! Comprehensive attention sink benchmark suite.

The file provides excellent coverage of attention sink benchmarks, including MHA/GQA variants, forward/backward passes, different implementations (wgmma_pipelined), and sliding window configurations. The implementation correctly uses the benchmarking framework and the globals().get("__name__") pattern.

examples/flash_attention/bench_example_flash_attention.py (3)

1-14: LGTM! Imports are clean and complete.

All required modules are imported and match the benchmark functions defined below.

56-77: Note: Identical parameters used for pipelined vs non-pipelined GQA benchmarks.

Both bench_example_gqa_fwd_bshd_wgmma_pipelined and bench_example_gqa_fwd_bshd use identical parameters (batch=1, heads=16, seq_len=1024, dim=128, is_causal=False, groups=16, tune=False). This is likely intentional for fair comparison between implementations, but please confirm this is the desired behavior.

102-103: LGTM! Main guard correctly handles exec context.

The globals().get("__name__") pattern properly handles both direct script execution and the benchmark harness's exec-based replay mechanism.

coderabbitai · 2025-11-16T03:22:54Z

.github/workflows/pr-perfbench-bot.yml

+      - name: Read markdown table
+        id: read_md
+        run: |
+          echo "content<<EOF" >> $GITHUB_OUTPUT
+          cat bench.md >> $GITHUB_OUTPUT
+          echo "EOF" >> $GITHUB_OUTPUT
+      - name: Upload PNG to GitHub and get URL


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Image upload step will not produce a valid embeddable URL

There are a couple of correctness issues in the new PNG upload/comment flow that will break the embedded image:

Filename mismatch between tree entry and URL (definite bug).

Tree entry path: perf_plot_${context.runId}.png (Line 101).

Raw URL path: bench_${context.runId}.png (Line 108).
The URL points to a file that does not exist in the created tree, so the image link in the PR comment will 404.

Insufficient token permissions for git APIs.

Top-level workflow permissions set contents: read only, but git.createBlob / git.createTree need write-level repo contents permissions. As written, the upload step will likely fail with a 403.

ci_performance.py is executed from main, not from the PR merge commit.

You checkout refs/pull/.../merge, then install the merged version, but later git checkout main and never switch back before running python maint/scripts/ci_performance.py. That means the script version used at runtime is the one from main, not the PR. The new bench.md/bench.png generation introduced in this PR won’t be exercised for open-PR /perf runs.

Suggested fixes

Make the tree path and URL consistent (choose one naming scheme). For example:

- const tree = await github.rest.git.createTree({ + const tree = await github.rest.git.createTree({ owner: context.repo.owner, repo: context.repo.repo, tree: [{ - path: `perf_plot_${context.runId}.png`, + path: `bench_${context.runId}.png`, mode: '100644', type: 'blob', sha: blob.data.sha }] }); - const url = `https://raw.githubusercontent.com/${context.repo.owner}/${context.repo.repo}/${tree.data.sha}/bench_${context.runId}.png` + const url = `https://raw.githubusercontent.com/${context.repo.owner}/${context.repo.repo}/${tree.data.sha}/bench_${context.runId}.png`

Relax or adjust workflow permissions so the job can create blobs/trees, e.g.:

permissions: - contents: read + contents: write

Ensure ci_performance.py runs from the PR merge commit so the new bench.md/bench.png logic is used. For instance:

Move git checkout main earlier and then re-checkout refs/pull/.../merge before the “Run performance test” step, or

Run the script with ./tll/bin/python and import it from the installed package instead of the working tree, or

After “Install original version”, explicitly git checkout refs/pull/${{ github.event.issue.number }}/merge before step “Run performance test”.

Also applies to: 81-109, 116-128

🏁 Script executed:

#!/bin/bash # Locate and display the workflow file with line numbers if [ -f ".github/workflows/pr-perfbench-bot.yml" ]; then echo "=== Workflow file found ===" cat -n ".github/workflows/pr-perfbench-bot.yml" | head -150 else echo "Workflow file not found at expected path" find . -name "pr-perfbench-bot.yml" -type f 2>/dev/null fi

Length of output: 5204

🏁 Script executed:

#!/bin/bash # Search for git.createBlob and git.createTree calls to understand the upload step rg -n "createBlob|createTree|perf_plot|bench_\$\{context" ".github/workflows/" -A 2 -B 2

Length of output: 1189

🏁 Script executed:

#!/bin/bash # Check for permissions block and git checkout sequences rg -n "permissions:|git checkout|ci_performance.py" ".github/workflows/pr-perfbench-bot.yml" -B 1 -A 1

Length of output: 273

Image upload step will not produce a valid embeddable URL and will fail with permissions error

All three correctness issues in the new PNG upload/comment flow are confirmed and must be fixed:

Filename mismatch between tree entry and URL (confirmed bug).

Line 101: Tree entry path uses perf_plot_${context.runId}.png

Line 108: Raw URL references bench_${context.runId}.png

The URL points to a file that does not exist in the created tree, causing a 404 on the embedded image.

Insufficient token permissions for git APIs (confirmed bug).

Line 8-9: Workflow permissions set contents: read only

Lines 90 and 97: Calls to github.rest.git.createBlob and github.rest.git.createTree require write-level permissions

The upload step will fail with a 403 error.

ci_performance.py executes from main, not the PR merge commit (confirmed bug).

Line 64: git checkout main switches to main branch

Lines 73-74: Script runs from the main branch working directory via the tl venv

The new bench.md/bench.png generation logic introduced in this PR will not be exercised for /perf runs.

Required fixes:

Make the tree path and URL consistent (both should use bench_ prefix):

- path: `perf_plot_${context.runId}.png`, + path: `bench_${context.runId}.png`,

Upgrade workflow permissions to allow blob/tree creation:

permissions: - contents: read + contents: write

Ensure ci_performance.py runs from the PR merge commit: after line 68, add git checkout refs/pull/${{ github.event.issue.number }}/merge before the "Run performance test" step.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

.github/workflows/pr-perfbench-bot.yml lines 75-81 and surrounding steps: the image upload flow has three confirmed issues — the tree entry filename and the constructed raw URL are inconsistent (use perf_plot_ vs bench_), the workflow permissions only grant contents: read while createBlob/createTree need write, and the perf script is run from main instead of the PR merge commit; to fix, make the filename used when creating the blob/tree and the URL construction use the same bench_${{ github.run_id }} (or context.runId) prefix, update workflow permissions to grant contents: write for the job (or at least for the step that calls git APIs), and before the "Run performance test" step add a checkout to refs/pull/${{ github.event.issue.number }}/merge so ci_performance.py runs against the PR merge commit.

examples/flash_attention/bench_example_flash_attention.py

coderabbitai · 2025-11-16T03:22:54Z

tilelang/tools/bench.py

+def analyze_records(records, out_dir):
+    # Analyze the data and draw a chart
+    records.sort(key=lambda x: x[1])
+    headers = ["Functions", "Avg Latency (ms)"]
+    print(
+        tabulate(_RECORDS, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))
+
+    names = [r[0] for r in records]
+    lats = [r[1] for r in records]
+    plt.figure(figsize=(max(len(names) * 2.2, 6), 6))
+    plt.bar(names, lats)
+    plt.xlabel("Latency (ms)")
+    plt.title("Benchmark Results")
+    out_path = os.path.join(out_dir, "bench_result.png")
+
+    plt.tight_layout()
+    plt.savefig(out_path, dpi=200)
+    plt.close()
+
+    print(f"Saved Bar chart to {out_path}")
+


⚠️ Potential issue | 🟡 Minor

Inconsistent use of parameter vs global variable.

The function receives records as a parameter and sorts it (Line 76), but Line 79 prints _RECORDS (the global) instead of the records parameter. Lines 81-82 correctly use records. This inconsistency could cause the printed table to differ from the chart when records ≠ _RECORDS.

Apply this diff to use the parameter consistently:

records.sort(key=lambda x: x[1]) headers = ["Functions", "Avg Latency (ms)"] print( - tabulate(_RECORDS, headers=headers, tablefmt="github", stralign="left", numalign="decimal")) + tabulate(records, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def analyze_records(records, out_dir):

# Analyze the data and draw a chart

records.sort(key=lambda x: x[1])

headers = ["Functions", "Avg Latency (ms)"]

print(

tabulate(_RECORDS, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))

names = [r[0] for r in records]

lats = [r[1] for r in records]

plt.figure(figsize=(max(len(names) * 2.2, 6), 6))

plt.bar(names, lats)

plt.xlabel("Latency (ms)")

plt.title("Benchmark Results")

out_path = os.path.join(out_dir, "bench_result.png")

plt.tight_layout()

plt.savefig(out_path, dpi=200)

plt.close()

print(f"Saved Bar chart to {out_path}")

def analyze_records(records, out_dir):

# Analyze the data and draw a chart

records.sort(key=lambda x: x[1])

headers = ["Functions", "Avg Latency (ms)"]

print(

tabulate(records, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))

names = [r[0] for r in records]

lats = [r[1] for r in records]

plt.figure(figsize=(max(len(names) * 2.2, 6), 6))

plt.bar(names, lats)

plt.xlabel("Latency (ms)")

plt.title("Benchmark Results")

out_path = os.path.join(out_dir, "bench_result.png")

plt.tight_layout()

plt.savefig(out_path, dpi=200)

plt.close()

print(f"Saved Bar chart to {out_path}")

🤖 Prompt for AI Agents

In tilelang/tools/bench.py around lines 74 to 94, the tabulate call prints the global _RECORDS instead of the function parameter records, causing a mismatch between the printed table and the plotted chart; update the tabulate call to use the local records variable (i.e., replace _RECORDS with records) so the printed table and chart reflect the same input, keeping the existing sort on records and other logic unchanged.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

.github/workflows/pr-perfbench-bot.yml (1)

59-74: Run ci_performance.py from the PR branch venv, not the main branch venv

This workflow will fail because it runs ci_performance.py from the main branch (line 64: git checkout main, line 73: source tl/bin/activate), but the main branch version doesn't generate bench.md or bench.png. The updated ci_performance.py in the PR branch calls tilelang.tools.bench.bench_all() and writes these files, whereas the main branch version runs performance.py and only prints metrics. When the workflow reaches line 75 ("Read markdown table"), the bench.md file won't exist, causing the workflow to fail.

Change line 73 from source tl/bin/activate to source tll/bin/activate so the performance script runs from the PR branch installation, which includes the updated benchmarking code.

🧹 Nitpick comments (1)

examples/flash_attention/bench_example_flash_attention.py (1)
1-108: Consider adding a module-level docstring.

This file serves as a benchmark aggregator for flash attention examples. Adding a brief module-level docstring would improve maintainability and help developers understand its purpose, especially since this is part of a broader benchmarking framework across the project.

For example:
"""Benchmark aggregator for flash attention examples.

This module registers benchmark functions for various flash attention
implementations (GQA, MHA, forward, backward, etc.) that can be executed
via tilelang.tools.bench.main().
"""
import tilelang.tools.bench
# ... rest of imports

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 90d37f3 and 64bad2e.

📒 Files selected for processing (2)

.github/workflows/pr-perfbench-bot.yml (2 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-10T13:29:29.347Z

Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

.github/workflows/pr-perfbench-bot.yml

🧬 Code graph analysis (1)

examples/flash_attention/bench_example_flash_attention.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

🔇 Additional comments (3)

.github/workflows/pr-perfbench-bot.yml (2)

8-9: ✓ Permissions upgraded correctly

Line 9 now grants contents: write, which is necessary for github.rest.git.createBlob and github.rest.git.createTree calls in the "Upload PNG" step (lines 90, 97). This resolves the earlier 403 error issue.

101-101: ✓ Filename consistency fixed

Both the tree entry path (line 101) and the raw URL construction (line 108) now use the same bench_${context.runId}.png prefix, resolving the earlier 404 mismatch issue.

Also applies to: 108-108

examples/flash_attention/bench_example_flash_attention.py (1)

1-14: All imported example modules exist and are correctly referenced.

Verification confirms that all 13 imported example modules exist in the examples/flash_attention directory and are correctly imported in the file. The import statements are valid and will resolve properly at runtime.

coderabbitai · 2025-11-16T03:46:43Z

examples/flash_attention/bench_example_flash_attention.py

+def bench_example_mha_bwd_bshd():
+    tilelang.tools.bench.process_func(
+        example_mha_bwd_bshd.main,
+        BATCH=1,
+        H=16,
+        N_CTX=512,
+        D_HEAD=64,
+        causal=False,
+    )
+
+
+def bench_example_mha_bwd_bhsd():
+    tilelang.tools.bench.process_func(
+        example_mha_bwd_bhsd.main,
+        BATCH=1,
+        H=16,
+        N_CTX=512,
+        D_HEAD=64,
+        causal=False,
+    )
+
+
+def bench_example_mha_bwd_bshd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(
+        example_mha_bwd_bshd_wgmma_pipelined.main,
+        BATCH=1,
+        H=32,
+        N_CTX=256,
+        D_HEAD=64,
+        causal=False)
+
+
+def bench_example_gqa_fwd_bshd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(
+        example_gqa_fwd_bshd_wgmma_pipelined.main,
+        batch=1,
+        heads=16,
+        seq_len=1024,
+        dim=128,
+        is_causal=False,
+        groups=16,
+        tune=False)
+
+
+def bench_example_gqa_fwd_bshd():
+    tilelang.tools.bench.process_func(
+        example_gqa_fwd_bshd.main,
+        batch=1,
+        heads=16,
+        seq_len=1024,
+        dim=128,
+        is_causal=False,
+        groups=16,
+        tune=False)
+
+
+def bench_example_mha_fwd_bhsd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(example_mha_fwd_bhsd_wgmma_pipelined.main)
+
+
+def bench_example_mha_fwd_bhsd():
+    tilelang.tools.bench.process_func(example_mha_fwd_bhsd.main)
+
+
+def bench_example_mha_fwd_bshd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(
+        example_mha_fwd_bshd_wgmma_pipelined.main, batch=1, heads=32, seq_len=256)
+
+
+def bench_example_mha_fwd_bshd():
+    tilelang.tools.bench.process_func(example_mha_fwd_bshd.main, batch=1, seq_len=256)
+
+
+def bench_example_mha_fwd_varlen():
+    tilelang.tools.bench.process_func(
+        example_mha_fwd_varlen.main, batch=4, heads=16, seq_len=512, dim=64)


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify parameter names match the example module signatures.

The benchmark functions use inconsistent parameter naming conventions:

MHA backward benchmarks (lines 29-58): UPPERCASE parameters (BATCH, H, N_CTX, D_HEAD, causal)

GQA/MHA forward benchmarks (lines 61-104): lowercase parameters (batch, heads, seq_len, dim, etc.)

This inconsistency was flagged in a previous review. Ensure these parameter names match the actual function signatures in each example module to avoid TypeError exceptions at runtime.

Run this script to verify parameter compatibility:

🏁 Script executed:

#!/bin/bash # Description: Extract main() function signatures from example modules to verify parameter names echo "=== MHA Backward Example Signatures (expecting UPPERCASE) ===" for file in example_mha_bwd_bshd.py example_mha_bwd_bhsd.py example_mha_bwd_bshd_wgmma_pipelined.py; do echo "--- $file ---" ast-grep --pattern 'def main($$$): $$$' examples/flash_attention/$file | head -20 done echo -e "\n=== GQA Forward Example Signatures (expecting lowercase) ===" for file in example_gqa_fwd_bshd.py example_gqa_fwd_bshd_wgmma_pipelined.py; do echo "--- $file ---" ast-grep --pattern 'def main($$$): $$$' examples/flash_attention/$file | head -20 done echo -e "\n=== MHA Forward Example Signatures (expecting lowercase) ===" for file in example_mha_fwd_bhsd.py example_mha_fwd_bhsd_wgmma_pipelined.py example_mha_fwd_bshd.py example_mha_fwd_bshd_wgmma_pipelined.py example_mha_fwd_varlen.py; do echo "--- $file ---" ast-grep --pattern 'def main($$$): $$$' examples/flash_attention/$file | head -20 done

Length of output: 21933

I need to search for the process_func() implementation to understand how it processes the keyword arguments against these tensor-based function signatures.The web search didn't provide the specific implementation of process_func. Let me search the codebase directly for it:

--- 🏁 Script executed: ```shell #!/bin/bash # Search for process_func implementation in the tilelang tools module rg -n "def process_func" --type=py -A 20

Length of output: 1300

Fix parameter mismatches between benchmark calls and example function signatures—multiple functions will fail at runtime.

The benchmark functions have critical parameter mismatches that will cause TypeError exceptions when called:

GQA forward benchmarks (lines 61-82): Passing batch, heads, seq_len, dim, is_causal, groups, tune but the actual signatures expect positional tensor arguments Q, K, V, Output

MHA forward bhsd benchmarks (lines 93-99): Passing no arguments but the actual signatures expect Q, K, V, Output tensor arguments

MHA forward bshd benchmarks (lines 100-102): Passing batch, seq_len but the actual signatures expect Q, K, V, Output tensor arguments

MHA forward varlen benchmark (lines 104-105): Passing batch, heads, seq_len, dim but the actual signature expects Q_unpad, K_unpad, V_unpad, cu_seqlens_q, cu_seqlens_k, max_seqlen_q, Output_unpad

The MHA backward benchmarks (lines 29-58) correctly match their function signatures with BATCH, H, N_CTX, D_HEAD, causal parameters.

🤖 Prompt for AI Agents

examples/flash_attention/bench_example_flash_attention.py lines 29-104: several benchmark wrappers call example_* .main with keyword params that do not match the example functions' signatures (GQA and MHA forward variants expect tensor arguments like Q,K,V,Output or unpadded variants), causing runtime TypeError; fix by replacing direct calls with small wrapper functions (or lambdas) that construct and pass dummy tensors/sequence metadata with the correct shapes/types expected by each example.main (e.g., for GQA/MHA forward build random Q,K,V,Output tensors with matching batch/heads/seq_len/dim or for varlen build Q_unpad/K_unpad/V_unpad plus cu_seqlens and max_seqlen), then pass those wrappers into tilelang.tools.bench.process_func so the signatures align and benchmarks run without TypeError.

LeiWang1999 · 2025-11-17T17:45:49Z

/perf

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (8)

.github/workflows/pr-perfbench-bot.yml (2)
59-74: Critical: Performance script executes from main branch, not PR merge commit.

The workflow installs and runs ci_performance.py from the main branch (Lines 64, 73-74), which means the new bench.md/bench.png generation logic introduced in this PR will not be exercised during /perf command runs. The workflow checks out the PR merge commit (Line 38), installs it into the tll venv (Lines 52-57), but then switches to main (Line 64) and runs the performance script from the tl venv.

To fix, ensure ci_performance.py runs from the PR merge commit. Add after Line 68:
       pip install .
+
+     - name: Switch back to PR merge commit
+       run: |
+         git checkout refs/pull/${{ github.event.issue.number }}/merge
81-109: Filename consistency issue resolved.

The tree entry path (Line 101) and the raw URL construction (Line 108) now both use bench_${context.runId}.png, fixing the previous mismatch that would have caused 404 errors on embedded images.
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)

6-7: CUDA requirement validation concern acknowledged.

The benchmark delegates to a CUDA-requiring module without runtime checks. When executed via bench.main(), pytest decorators are bypassed. However, this is a design limitation of the broader benchmarking framework that affects all CUDA-dependent benchmarks uniformly, not specific to this file.

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

1-7: Add missing CUDA decorators and import.

The past review comment remains unresolved. This file still lacks import tilelang.testing and @tilelang.testing.requires_cuda decorators on all six benchmark functions (lines 10, 14, 18, 22, 28, 41), which call GPU-exclusive kernels.

Also applies to: 10-51
tilelang/tools/bench.py (4)
100-102: Critical: exec namespace breaks __name__ guard in bench scripts.

This past review comment remains unresolved. The empty module = {} dict causes NameError when bench scripts contain if __name__ == "__main__": guards because exec() won't populate __name__, __file__, or other module-level variables.

35-40: Silently ignoring warmup failures hides configuration issues.

This past review comment remains unresolved. The try-except-pass during warmup suppresses all exceptions, making it impossible to diagnose setup problems (missing dependencies, incorrect arguments, environment issues).

79-79: Inconsistent use of parameter vs global variable.

This past review comment remains unresolved. Line 79 prints _RECORDS (the global) instead of the records parameter, causing the printed table to differ from the chart when the parameter differs from the global.

74-93: Handle empty records gracefully.

While the explicit max(len(r[0]) for r in records) mentioned in past reviews appears to have been removed, analyze_records should still guard against empty records to prevent plotting errors and provide clearer feedback.

Apply this diff:
 def analyze_records(records, out_dir):
     # Analyze the data and draw a chart
+    if not records:
+        print("No benchmark records to analyze")
+        return
     records.sort(key=lambda x: x[1])

🧹 Nitpick comments (3)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
2-2: Remove unused import.

The tilelang module is imported but never used in this file.
 import tilelang.tools.bench
-import tilelang
 import tilelang_example_sparse_tensorcore
maint/scripts/ci_performance.py (2)
46-49: Potential edge case: zero or negative figure width.

Line 48 calculates fig_width = max(0, len(df) * 0.35), which could be 0 if the DataFrame is empty or very small. A zero-width figure would fail or produce an unusable plot.

Consider setting a minimum width:
-fig_width = max(0, len(df) * 0.35)
+fig_width = max(6, len(df) * 0.35)
 plt.figure(figsize=(fig_width, 8))
31-37: Potential KeyError if benchmark keys differ between versions.

Line 34 iterates over data_v1.keys() and directly accesses data_v2[key] without checking if the key exists. If a benchmark is added or removed between versions, this will raise a KeyError.

Add a check to handle missing keys:
 table = []
 for key in data_v1.keys():
+    if key not in data_v2:
+        continue
     speedup = data_v1[key] / data_v2[key]
     table.append([key, data_v1[key], data_v2[key], speedup])
Or use a symmetric approach that handles keys in either version:
table = []
all_keys = set(data_v1.keys()) | set(data_v2.keys())
for key in all_keys:
    if key in data_v1 and key in data_v2:
        speedup = data_v1[key] / data_v2[key]
        table.append([key, data_v1[key], data_v2[key], speedup])

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64bad2e and 326b8e5.

📒 Files selected for processing (31)

.github/workflows/pr-perfbench-bot.yml (2 hunks)
examples/analyze/bench_example_analyze.py (1 hunks)
examples/attention_sink/bench_example_attention_sink.py (1 hunks)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1 hunks)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1 hunks)
examples/cast/bench_example_cast.py (1 hunks)
examples/convolution/bench_example_convolution.py (1 hunks)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1 hunks)
examples/deepseek_mla/bench_example_mla_decode.py (1 hunks)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1 hunks)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1 hunks)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1 hunks)
examples/dynamic_shape/bench_example_dynamic.py (1 hunks)
examples/elementwise/bench_example_elementwise.py (1 hunks)
examples/flash_attention/bench_example_flash_attention.py (1 hunks)
examples/flash_decoding/bench_example_flash_decoding.py (1 hunks)
examples/fusedmoe/bench_example_fusedmoe.py (1 hunks)
examples/gemm/bench_example_gemm.py (1 hunks)
examples/gemm_fp8/bench_example_gemm_fp8.py (1 hunks)
examples/gemm_splitk/bench_example_gemm_splitk.py (1 hunks)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1 hunks)
examples/gemv/bench_example_gemv.py (1 hunks)
examples/linear_attention/bench_linear_attn.py (1 hunks)
examples/minference/bench_vs_sparse_attn.py (1 hunks)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1 hunks)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1 hunks)
examples/topk/bench_topk_tilelang.py (1 hunks)
examples/warp_specialize/bench_example_warp_specialize.py (1 hunks)
log.txt (1 hunks)
maint/scripts/ci_performance.py (1 hunks)
tilelang/tools/bench.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (19)

examples/flash_decoding/bench_example_flash_decoding.py
examples/seer_attention/bench_block_sparse_attn_tilelang.py
examples/elementwise/bench_example_elementwise.py
examples/linear_attention/bench_linear_attn.py
examples/fusedmoe/bench_example_fusedmoe.py
log.txt
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py
examples/dequantize_gemm/bench_example_dequantize_gemm.py
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
examples/deepseek_nsa/bench_example_tilelang_nsa.py
examples/deepseek_mla/bench_example_mla_decode.py
examples/minference/bench_vs_sparse_attn.py
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
examples/convolution/bench_example_convolution.py
examples/dynamic_shape/bench_example_dynamic.py
examples/analyze/bench_example_analyze.py
examples/attention_sink/bench_example_attention_sink.py
examples/cast/bench_example_cast.py
examples/gemm/bench_example_gemm.py

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-10T13:29:29.347Z

Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.

Applied to files:

.github/workflows/pr-perfbench-bot.yml

🧬 Code graph analysis (10)

examples/topk/bench_topk_tilelang.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/warp_specialize/bench_example_warp_specialize.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/gemv/bench_example_gemv.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

maint/scripts/ci_performance.py (2)

tilelang/env.py (1)

disable_cache (275-276)

maint/scripts/performance.py (1)

run (22-69)

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/gemm_splitk/bench_example_gemm_splitk.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

examples/flash_attention/bench_example_flash_attention.py (1)

tilelang/tools/bench.py (2)

process_func (33-71)

main (96-108)

🪛 Ruff (0.14.5)

tilelang/tools/bench.py

39-40: try-except-pass detected, consider logging the exception

(S110)

39-39: Do not catch blind exception: Exception

(BLE001)

51-51: Do not catch blind exception: Exception

(BLE001)

102-102: Use of exec detected

(S102)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
GitHub Check: Test for Python 3.12 with Metal (on macos-latest)

🔇 Additional comments (12)

examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)

1-10: LGTM!

The benchmark wrapper follows the established pattern correctly, delegating to tilelang.tools.bench.process_func and using the standard main guard.

.github/workflows/pr-perfbench-bot.yml (1)

8-9: Permissions upgrade successfully addresses blob/tree creation requirements.

The change from contents: read to contents: write correctly enables the GitHub API calls at Lines 90 and 97 (createBlob, createTree).

Based on learnings.

examples/topk/bench_topk_tilelang.py (1)

1-10: LGTM!

The benchmark wrapper correctly follows the established pattern with appropriate imports and delegation to the bench framework.

maint/scripts/ci_performance.py (2)

4-9: LGTM!

The new imports and cache disabling integrate cleanly with the benchmarking framework.

22-29: The bench_all() function is correctly implemented and exists in tilelang.tools.bench.

Verification confirms the function is defined at line 111 of tilelang/tools/bench.py and is properly exported for use by the CI script.

examples/gemm_splitk/bench_example_gemm_splitk.py (1)

1-15: LGTM!

The benchmark wrapper correctly registers two split-k GEMM variants following the established pattern.

examples/gemm_fp8/bench_example_gemm_fp8.py (1)

1-20: LGTM!

The benchmark wrapper correctly registers three FP8 GEMM variants following the established pattern, with clear naming that distinguishes the 2xAcc, intrinsic, and base implementations.

examples/gemv/bench_example_gemv.py (1)

5-6: Clarify confusing do_bench parameter logic.

The do_bench=False parameter is indeed unique among all benchmark scripts and is accepted by example_gemv.main. However, the parameter logic appears inverted: the condition at line 362 (if not do_bench:) runs extended benchmarking when do_bench=False, which contradicts the parameter name's intuitive meaning.

Confirm whether this inverted logic is intentional or if the condition should be if do_bench: instead.

examples/warp_specialize/bench_example_warp_specialize.py (1)

28-29: LGTM!

Correctly uses globals().get("__name__") to avoid NameError when tilelang.tools.bench.main() re-executes this file via exec().

examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)

54-55: LGTM!

Correctly uses globals().get("__name__") pattern.

tilelang/tools/bench.py (1)

111-144: LGTM!

The bench_all() function correctly:

Uses importlib.util for proper module loading (preserves __file__, imports)

Walks the examples directory to discover bench scripts

Adds directories to sys.path before loading modules

Discovers and invokes all bench_* functions

examples/flash_attention/bench_example_flash_attention.py (1)

107-108: LGTM!

Correctly uses globals().get("__name__") pattern.

coderabbitai · 2025-11-19T10:09:17Z

examples/flash_attention/bench_example_flash_attention.py

+import tilelang.tools.bench
+import example_gqa_bwd
+import example_gqa_bwd_tma_reduce_varlen
+import example_gqa_bwd_wgmma_pipelined
+import example_gqa_fwd_bshd
+import example_gqa_fwd_bshd_wgmma_pipelined
+import example_mha_bwd_bshd
+import example_mha_bwd_bhsd
+import example_mha_bwd_bshd_wgmma_pipelined
+import example_mha_fwd_bhsd
+import example_mha_fwd_bhsd_wgmma_pipelined
+import example_mha_fwd_bshd
+import example_mha_fwd_bshd_wgmma_pipelined
+import example_mha_fwd_varlen


🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Verify CUDA requirements for flash attention benchmarks.

All 13 benchmark functions invoke flash attention kernels (GQA and MHA variants), which typically require GPU execution. Similar to bench_example_blocksparse_attention.py, this file may need import tilelang.testing and @tilelang.testing.requires_cuda decorators on each benchmark function. Confirm whether these benchmarks can run on CPU or if decorators should be added.

Also applies to: 17-105

🏁 Script executed:

#!/bin/bash # Check if flash attention examples have CUDA dependencies echo "=== Checking for CUDA/GPU usage in flash attention modules ===" for module in example_gqa_bwd example_gqa_bwd_tma_reduce_varlen example_gqa_bwd_wgmma_pipelined example_gqa_fwd_bshd example_gqa_fwd_bshd_wgmma_pipelined example_mha_bwd_bshd example_mha_bwd_bhsd example_mha_bwd_bshd_wgmma_pipelined example_mha_fwd_bhsd example_mha_fwd_bhsd_wgmma_pipelined example_mha_fwd_bshd example_mha_fwd_bshd_wgmma_pipelined example_mha_fwd_varlen; do echo "--- Module: ${module}.py ---" fd "${module}.py" examples/flash_attention --exec rg -nC2 "@tilelang.jit|@triton.jit|torch.*cuda|\.cuda" {} \; 2>/dev/null | head -20 done

Length of output: 4188

Now I need to examine the benchmark file directly to understand the structure and compare with similar files:

Benchmarks require CUDA decorators to prevent test failures on non-GPU systems.

All 13 benchmark functions invoke GPU-only flash attention kernels decorated with @tilelang.jit. These kernels will fail on CPU-only systems. Add import tilelang.testing and decorate each benchmark function with @tilelang.testing.requires_cuda to skip tests when CUDA is unavailable, following the pattern used in bench_example_blocksparse_attention.py.

🤖 Prompt for AI Agents

In examples/flash_attention/bench_example_flash_attention.py around lines 1 to 14, the benchmark module imports GPU-only kernels but lacks CUDA availability checks; add "import tilelang.testing" at top and annotate each benchmark function with "@tilelang.testing.requires_cuda" so tests are skipped on systems without CUDA, mirroring bench_example_blocksparse_attention.py; ensure the decorator is applied to all 13 benchmark functions that call @tilelang.jit kernels.

coderabbitai · 2025-11-19T10:09:17Z

examples/flash_attention/bench_example_flash_attention.py

+def bench_example_gqa_bwd_tma_reduce_varlen():
+    tilelang.tools.bench.process_func(example_gqa_bwd_tma_reduce_varlen.main)
+
+
+def bench_example_gqa_bwd():
+    tilelang.tools.bench.process_func(example_gqa_bwd.main)
+
+
+def bench_example_gqa_bwd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(example_gqa_bwd_wgmma_pipelined.main)
+
+
+def bench_example_mha_bwd_bshd():
+    tilelang.tools.bench.process_func(
+        example_mha_bwd_bshd.main,
+        BATCH=1,
+        H=16,
+        N_CTX=512,
+        D_HEAD=64,
+        causal=False,
+    )
+
+
+def bench_example_mha_bwd_bhsd():
+    tilelang.tools.bench.process_func(
+        example_mha_bwd_bhsd.main,
+        BATCH=1,
+        H=16,
+        N_CTX=512,
+        D_HEAD=64,
+        causal=False,
+    )
+
+
+def bench_example_mha_bwd_bshd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(
+        example_mha_bwd_bshd_wgmma_pipelined.main,
+        BATCH=1,
+        H=32,
+        N_CTX=256,
+        D_HEAD=64,
+        causal=False)
+
+
+def bench_example_gqa_fwd_bshd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(
+        example_gqa_fwd_bshd_wgmma_pipelined.main,
+        batch=1,
+        heads=16,
+        seq_len=1024,
+        dim=128,
+        is_causal=False,
+        groups=16,
+        tune=False)
+
+
+def bench_example_gqa_fwd_bshd():
+    tilelang.tools.bench.process_func(
+        example_gqa_fwd_bshd.main,
+        batch=1,
+        heads=16,
+        seq_len=1024,
+        dim=128,
+        is_causal=False,
+        groups=16,
+        tune=False)
+
+
+def bench_example_mha_fwd_bhsd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(example_mha_fwd_bhsd_wgmma_pipelined.main)
+
+
+def bench_example_mha_fwd_bhsd():
+    tilelang.tools.bench.process_func(example_mha_fwd_bhsd.main)
+
+
+def bench_example_mha_fwd_bshd_wgmma_pipelined():
+    tilelang.tools.bench.process_func(
+        example_mha_fwd_bshd_wgmma_pipelined.main, batch=1, heads=32, seq_len=256)
+
+
+def bench_example_mha_fwd_bshd():
+    tilelang.tools.bench.process_func(example_mha_fwd_bshd.main, batch=1, seq_len=256)
+
+
+def bench_example_mha_fwd_varlen():
+    tilelang.tools.bench.process_func(
+        example_mha_fwd_varlen.main, batch=4, heads=16, seq_len=512, dim=64)
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify parameter compatibility with example function signatures.

Past reviews flagged critical parameter mismatches between benchmark calls and example function signatures (GQA/MHA forward variants expecting tensor arguments vs. scalar kwargs). While marked as addressed in prior commits, confirm that all benchmark function calls match their target signatures to avoid TypeError exceptions at runtime.

🏁 Script executed:

#!/bin/bash # Extract main() signatures from flash attention examples echo "=== Checking function signatures ===" for module in example_gqa_bwd example_gqa_bwd_tma_reduce_varlen example_gqa_bwd_wgmma_pipelined example_gqa_fwd_bshd example_gqa_fwd_bshd_wgmma_pipelined example_mha_bwd_bshd example_mha_bwd_bhsd example_mha_bwd_bshd_wgmma_pipelined example_mha_fwd_bhsd example_mha_fwd_bhsd_wgmma_pipelined example_mha_fwd_bshd example_mha_fwd_bshd_wgmma_pipelined example_mha_fwd_varlen; do echo "--- ${module}.py ---" fd "${module}.py" examples/flash_attention --exec ast-grep --pattern 'def main($$$): $$$' {} \; 2>/dev/null | head -5 done

Length of output: 6391

Critical parameter mismatches in forward function benchmark calls.

Verification confirms the original review comment is valid. All forward attention function calls (lines 65–105) pass scalar keyword arguments (batch, heads, seq_len, dim, etc.) or no arguments, but their main() signatures require tensor arguments:

GQA/MHA forward functions (lines 65–96, 103–105): expect Q: T.Tensor(...), K: T.Tensor(...), V: T.Tensor(...), Output: T.Tensor(...) but receive scalar kwargs or no args

MHA forward varlen (lines 103–105): expects Q_unpad: T.Tensor(...), K_unpad: T.Tensor(...), V_unpad: T.Tensor(...), cu_seqlens_q: T.Tensor([...], "int32") but receives batch=4, heads=16, seq_len=512, dim=64

These mismatches will cause TypeError exceptions at runtime when process_func attempts to invoke the target functions. Align all forward function calls with their tensor-based signatures.

🤖 Prompt for AI Agents

In examples/flash_attention/bench_example_flash_attention.py around lines 17 to 105, the benchmark helper calls for the forward GQA/MHA examples pass scalar keyword args (batch, heads, seq_len, dim, etc.) or no args, but the referenced main() functions expect Tensor inputs (Q, K, V, Output or unpadded Q/K/V and cu_seqlens tensors). Replace the scalar kwargs with code that constructs and passes appropriately-shaped T.Tensor inputs (matching the signatures: dense Q/K/V/Output or unpadded tensors plus cu_seqlens) before calling process_func, or wrap the main functions with small adapter lambdas that create those tensors from the scalar parameters; ensure tensor dtypes and shapes match the expected signatures to avoid TypeError at runtime.

coderabbitai · 2025-11-19T10:09:17Z

examples/warp_specialize/bench_example_warp_specialize.py

+def bench_example_warp_specialize_gemm_barrierpipe_stage2():
+    tilelang.tools.bench.process_func(
+        example_warp_specialize_gemm_barrierpipe_stage2.main, M=1024, N=1024, K=1024)
+
+
+def bench_example_warp_specialize_gemm_copy_0_gemm_1():
+    tilelang.tools.bench.process_func(
+        example_warp_specialize_gemm_copy_0_gemm_1.main, M=1024, N=1024, K=1024)
+
+
+def bench_example_warp_specialize_gemm_copy_1_gemm_0():
+    tilelang.tools.bench.process_func(
+        example_warp_specialize_gemm_copy_1_gemm_0.main, M=1024, N=1024, K=1024)
+
+
+def bench_example_warp_specialize_gemm_softpipe_stage2():
+    tilelang.tools.bench.process_func(
+        example_warp_specialize_gemm_softpipe_stage2.main, M=1024, N=1024, K=1024)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

Verify CUDA requirements for warp-specialized GEMM benchmarks.

All four benchmark functions invoke warp-specialized GEMM kernels. Warp specialization is a CUDA-specific optimization, and similar benchmark files in this PR (e.g., bench_example_blocksparse_attention.py) were flagged for missing @tilelang.testing.requires_cuda decorators. Confirm whether these benchmarks can run on CPU or if decorators should be added.

🏁 Script executed:

#!/bin/bash # Check if the warp specialize example modules have CUDA dependencies echo "=== Checking for CUDA/GPU usage in warp specialize modules ===" for module in example_warp_specialize_gemm_barrierpipe_stage2 example_warp_specialize_gemm_copy_0_gemm_1 example_warp_specialize_gemm_copy_1_gemm_0 example_warp_specialize_gemm_softpipe_stage2; do echo "--- Module: ${module}.py ---" fd "${module}.py" examples/warp_specialize --exec rg -n "@tilelang.jit|@triton.jit|torch.*cuda|\.cuda" {} \; 2>/dev/null || echo "Not found" done

Length of output: 1309

--- 🏁 Script executed: ```shell #!/bin/bash # Check if benchmark functions have CUDA decorators echo "=== Checking benchmark functions for decorators ===" rg -n -B2 "def bench_example_warp_specialize" examples/warp_specialize/bench_example_warp_specialize.py

Length of output: 332

--- 🏁 Script executed: ```shell #!/bin/bash # Check blocksparse_attention benchmark file for decorator pattern echo "=== Checking blocksparse_attention benchmark decorators ===" rg -n -B2 "def bench_" examples/blocksparse_attention/bench_example_blocksparse_attention.py | head -30

Length of output: 464

--- 🏁 Script executed: ```shell #!/bin/bash # Search for @tilelang.testing.requires_cuda usage in codebase echo "=== Searching for CUDA decorators in test/benchmark files ===" rg -n "@tilelang.testing.requires_cuda|@tilelang.testing.requires_gpu" examples/ --type py -B1 | head -40

Length of output: 2695

--- 🏁 Script executed: ```shell #!/bin/bash # Check warp_specialize test file for compute version requirements echo "=== Checking warp_specialize test compute version requirements ===" rg -n "@tilelang.testing.requires_cuda" examples/warp_specialize/test_example_warp_specialize.py -A1 | head -20

Length of output: 846

Add CUDA decorators to benchmark functions.

All four benchmark functions invoke CUDA-only warp-specialized GEMM kernels that require Ada GPU (compute capability 9.0). The corresponding test functions in test_example_warp_specialize.py have @tilelang.testing.requires_cuda and @tilelang.testing.requires_cuda_compute_version_eq(9, 0) decorators; the benchmarks must have the same decorators:

bench_example_warp_specialize_gemm_barrierpipe_stage2() (line 8)

bench_example_warp_specialize_gemm_copy_0_gemm_1() (line 13)

bench_example_warp_specialize_gemm_copy_1_gemm_0() (line 18)

bench_example_warp_specialize_gemm_softpipe_stage2() (line 23)

🤖 Prompt for AI Agents

In examples/warp_specialize/bench_example_warp_specialize.py around lines 8–25, the four benchmark functions call CUDA-only Ada (compute capability 9.0) warp-specialized GEMM kernels but lack the CUDA decorators; add @tilelang.testing.requires_cuda and @tilelang.testing.requires_cuda_compute_version_eq(9, 0) above each of these functions (bench_example_warp_specialize_gemm_barrierpipe_stage2, bench_example_warp_specialize_gemm_copy_0_gemm_1, bench_example_warp_specialize_gemm_copy_1_gemm_0, bench_example_warp_specialize_gemm_softpipe_stage2); if tilelang.testing is not already imported in this file, add the appropriate import so the decorators resolve.

yyttt6 · 2025-11-19T10:43:23Z

/perf

yyttt6 · 2025-11-19T10:51:29Z

/perf

yyttt6 · 2025-11-19T10:58:11Z

/perf

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

examples/attention_sink/bench_example_attention_sink.py Outdated Show resolved Hide resolved

examples/flash_attention/bench_example_flash_attention.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

yyttt6 force-pushed the main branch from 9cf0058 to 0726946 Compare November 13, 2025 14:45

coderabbitai bot reviewed Nov 13, 2025

View reviewed changes

yyttt6 force-pushed the main branch from 0726946 to 90d37f3 Compare November 16, 2025 03:14

coderabbitai bot reviewed Nov 16, 2025

View reviewed changes

yyttt6 added 15 commits November 19, 2025 09:59

update 0

f129b3b

update1

de9c98a

format

12df0ec

format

725342d

format

ec019fc

update

29677d2

format

09912b5

fix

17504a0

strip_decorators

f2a4128

rabbit

7fa4d18

fix

6edc3a9

rabbit

041aa95

fix

4767697

format

31bfbe6

update

c16ebfe

yyttt6 added 7 commits November 19, 2025 09:59

update'

2921c73

format

1c23908

smaller

730ae68

update1

164672f

format

c7b3413

rabbit

494ba47

format

326b8e5

yyttt6 force-pushed the main branch from 64bad2e to 326b8e5 Compare November 19, 2025 10:02

coderabbitai bot reviewed Nov 19, 2025

View reviewed changes

fix

7d75df2

fix

58e7ffe

yyttt6 closed this by deleting the head repository Nov 19, 2025

		import tilelang.tools.bench
		import example_deepgemm_fp8_2xAcc

		import tilelang.tools.bench
		import block_sparse_attn_tilelang

		import example_tilelang_nsa_fwd
		import example_tilelang_nsa_decode

[Feature]: Add benchmark scripts for examples #1240

[Feature]: Add benchmark scripts for examples #1240

Uh oh!

Conversation

yyttt6 commented Nov 12, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

coderabbitai bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

LeiWang1999 commented Nov 12, 2025

Uh oh!

yyttt6 commented Nov 12, 2025

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 13, 2025

yyttt6 commented Nov 12, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 12, 2025 •

edited

Loading