-
Notifications
You must be signed in to change notification settings - Fork 332
[Feature]: Add benchmark scripts for examples #1240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
WalkthroughAdds a lightweight benchmarking framework (tilelang/tools/bench.py), ~30 per-example bench entry scripts under examples/* that register bench_* wrappers calling example main() via process_func, CI + maint updates to parse results and upload an image, and a runtime log artifact. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant BenchMain as tilelang.tools.bench.main()
participant Discover as discover bench_ funcs
participant BenchFn as bench_*()
participant Proc as process_func()
participant Warm as warmup runs
participant Runs as timed runs
participant Records as record store
participant Analyze as analyze_records()
participant Output as bench.md / bench_result.png
User->>BenchMain: run examples/.../bench_*.py or bench_all()
BenchMain->>Discover: introspect caller file(s) for bench_* functions
Discover-->>BenchMain: list of bench functions
loop for each bench_* function
BenchMain->>BenchFn: invoke bench_*()
BenchFn->>Proc: process_func(example.main, ...)
Proc->>Warm: perform warmup runs (ignored)
Proc->>Runs: perform timed runs, collect latencies/failures
Runs-->>Proc: latencies / failure info
Proc-->>Records: append result entry
end
BenchMain->>Analyze: analyze_records(Records)
Analyze-->>Output: emit bench.md and bench_result.png
Output-->>User: display summary and files
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧠 Learnings (1)📚 Learning: 2025-10-10T13:29:29.347ZApplied to files:
🔇 Additional comments (5)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 17
🧹 Nitpick comments (2)
examples/gemm/bench_example_gemm.py (1)
1-29: Note potential duplication with bench_gemm.py.This file appears to duplicate functionality from
examples/gemm/bench_gemm.pyin the same directory. Both files define identical benchmark wrapper functions for the same example modules. Consider consolidating these into a single benchmark file unless there's a specific reason for maintaining both.examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
1-3: Consider reordering imports for clarity.Import
tilelangbefore its sub-moduletilelang.tools.benchfor better readability.Apply this diff:
-import tilelang.tools.bench import tilelang +import tilelang.tools.bench import tilelang_example_sparse_tensorcore
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (29)
examples/analyze/bench_example_analyze.py(1 hunks)examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/blocksparse_attention/bench_example_blocksparse_attention.py(1 hunks)examples/blocksparse_gemm/bench_example_blocksparse_gemm.py(1 hunks)examples/cast/bench_example_cast.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py(1 hunks)examples/deepseek_mla/bench_example_mla_decode.py(1 hunks)examples/deepseek_nsa/bench_example_tilelang_nsa.py(1 hunks)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py(1 hunks)examples/dequantize_gemm/bench_example_dequantize_gemm.py(1 hunks)examples/dynamic_shape/bench_example_dynamic.py(1 hunks)examples/elementwise/bench_example_elementwise.py(1 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)examples/flash_decoding/bench_example_flash_decoding.py(1 hunks)examples/fusedmoe/bench_example_fusedmoe.py(1 hunks)examples/gemm/bench_example_gemm.py(1 hunks)examples/gemm/bench_gemm.py(1 hunks)examples/gemm_fp8/bench_example_gemm_fp8.py(1 hunks)examples/gemm_splitk/bench_example_gemm_splitk.py(1 hunks)examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py(1 hunks)examples/gemv/bench_example_gemv.py(1 hunks)examples/linear_attention/bench_linear_attn.py(1 hunks)examples/minference/bench_vs_sparse_attn.py(1 hunks)examples/seer_attention/bench_block_sparse_attn_tilelang.py(1 hunks)examples/sparse_tensorcore/bench_example_sparse_tensorcore.py(1 hunks)examples/topk/bench_topk_tilelang.py(1 hunks)examples/warp_specialize/bench_example_warp_specialize.py(1 hunks)tilelang/tools/bench.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (29)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_le(121-122)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
tilelang/tools/bench.py (16)
examples/attention_sink/bench_example_attention_sink.py (1)
main(66-67)examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)
main(14-15)examples/analyze/bench_example_analyze.py (1)
main(14-15)examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
main(34-35)examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)
main(9-10)examples/cast/bench_example_cast.py (1)
main(14-15)examples/convolution/bench_example_convolution.py (1)
main(16-17)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
main(11-12)examples/deepseek_mla/bench_example_mla_decode.py (1)
main(11-12)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (1)
main(35-36)examples/dequantize_gemm/bench_example_dequantize_gemm.py (1)
main(45-46)examples/dynamic_shape/bench_example_dynamic.py (1)
main(9-10)examples/elementwise/bench_example_elementwise.py (1)
main(9-10)examples/flash_attention/bench_example_flash_attention.py (1)
main(87-88)examples/flash_decoding/bench_example_flash_decoding.py (1)
main(16-17)examples/fusedmoe/bench_example_fusedmoe.py (1)
main(9-10)
examples/gemm/bench_example_gemm.py (2)
examples/gemm/bench_gemm.py (4)
bench_example_gemm_autotune(8-9)bench_example_gemm_intrinsics(12-13)bench_example_gemm_schedule(16-17)bench_example_gemm(20-21)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/seer_attention/bench_block_sparse_attn_tilelang.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (7)
examples/deepseek_v32/sparse_mla_bwd.py (2)
sparse_mla_bwd(283-320)test_sparse_mla_bwd(334-384)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)examples/deepseek_v32/topk_selector.py (1)
test_topk_selector(188-245)examples/deepseek_v32/fp8_lighting_indexer.py (1)
test_fp8_lighting_indexer(260-302)tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(105-106)examples/deepseek_v32/sparse_mla_fwd.py (1)
test_sparse_mla_fwd(253-299)examples/deepseek_v32/sparse_mla_fwd_pipelined.py (1)
test_sparse_mla_fwd_pipelined(400-452)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version(38-102)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/dynamic_shape/bench_example_dynamic.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/flash_decoding/bench_example_flash_decoding.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_le(121-122)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_mla/bench_example_mla_decode.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(105-106)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (2)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(105-106)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemm/bench_gemm.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/warp_specialize/bench_example_warp_specialize.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_eq(113-114)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/topk/bench_topk_tilelang.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/elementwise/bench_example_elementwise.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/analyze/bench_example_analyze.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemm_splitk/bench_example_gemm_splitk.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/flash_attention/bench_example_flash_attention.py (2)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(105-106)
examples/convolution/bench_example_convolution.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_le(121-122)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/fusedmoe/bench_example_fusedmoe.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/linear_attention/bench_linear_attn.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemv/bench_example_gemv.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (2)
tilelang/testing/__init__.py (1)
requires_cuda_compute_version_eq(113-114)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/minference/bench_vs_sparse_attn.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/attention_sink/bench_example_attention_sink.py (2)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)tilelang/testing/__init__.py (1)
requires_cuda_compute_version_ge(105-106)
examples/cast/bench_example_cast.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
🪛 Ruff (0.14.4)
tilelang/tools/bench.py
33-34: try-except-pass detected, consider logging the exception
(S110)
33-33: Do not catch blind exception: Exception
(BLE001)
45-45: Do not catch blind exception: Exception
(BLE001)
66-66: Use of exec detected
(S102)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (32)
examples/gemm_splitk/bench_example_gemm_splitk.py (1)
1-19: LGTM!The benchmark wrapper correctly follows the established pattern and properly integrates the GEMM split-K examples into the benchmarking framework.
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
1-24: LGTM!The benchmark wrapper correctly integrates three FP8 GEMM variants into the benchmarking framework with a clean, consistent structure.
examples/gemv/bench_example_gemv.py (1)
1-14: LGTM!The benchmark wrapper correctly integrates the GEMV example into the benchmarking framework with the expected structure.
examples/dynamic_shape/bench_example_dynamic.py (1)
1-14: LGTM!The benchmark wrapper correctly integrates the dynamic shape example into the benchmarking framework with a clean, consistent structure.
examples/fusedmoe/bench_example_fusedmoe.py (1)
1-14: LGTM! Standard benchmark wrapper.The implementation correctly follows the established benchmarking pattern, delegating to the shared
tilelang.tools.benchutility.examples/analyze/bench_example_analyze.py (1)
1-19: LGTM! Correctly wraps multiple example entry points.The implementation follows the established pattern and properly wires both GEMM and convolution analyze examples into the benchmarking framework.
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)
1-14: LGTM! Standard benchmark wrapper.The implementation correctly follows the established benchmarking pattern.
examples/elementwise/bench_example_elementwise.py (1)
1-14: LGTM! Standard benchmark wrapper.The implementation correctly follows the established benchmarking pattern.
examples/cast/bench_example_cast.py (1)
1-19: LGTM! Correctly wraps multiple example entry points.The implementation follows the established pattern and properly wires both FP8 casting examples into the benchmarking framework.
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)
1-19: LGTM! Correctly wraps multiple example entry points.The implementation follows the established pattern and properly wires both NSA forward and decode examples into the benchmarking framework.
tilelang/tools/bench.py (2)
11-24: LGTM!The context manager properly handles resource cleanup in the finally block, ensuring devnull is closed even if exceptions occur during setup.
27-58: LGTM!The benchmarking logic correctly handles warmup, timing, and failure reporting. Silently ignoring warmup exceptions is appropriate for benchmarking scenarios.
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (2)
5-8: Verify correct decorator usage.Ensure the CUDA compute version requirement (9.0) matches the hardware constraints of the
example_deepgemm_fp8_2xAccmodule.
11-16: LGTM!The main function and entry point follow the standard benchmarking pattern correctly.
examples/gemm/bench_gemm.py (1)
1-29: LGTM!All benchmark wrapper functions correctly delegate to
tilelang.tools.bench.process_funcand the main entry point follows the established pattern.examples/convolution/bench_example_convolution.py (2)
6-13: Verify the decorator mismatch with AI summary.The AI summary states that
bench_example_convolutionusesrequires_cuda_compute_version_eq(9, 0), but Line 7 showsrequires_cuda_compute_version_le(8, 9). Additionally, the summary indicatesbench_example_convolution_autotuneis decorated withrequires_cuda, but Line 12 shows no decorators.
16-21: LGTM!The main function and entry point are correctly implemented.
examples/flash_decoding/bench_example_flash_decoding.py (2)
6-13: LGTM!Both benchmark functions correctly delegate to the benchmarking utility. The decorator usage appropriately restricts hardware requirements for the GQA decode benchmark.
16-21: LGTM!The main function and entry point follow the standard pattern.
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (2)
9-32: LGTM!All benchmark functions correctly delegate to the appropriate test functions with proper CUDA constraints where needed.
35-40: LGTM!The main function and entry point are correctly implemented.
examples/warp_specialize/bench_example_warp_specialize.py (2)
8-29: LGTM!All benchmark wrapper functions correctly use CUDA decorators and delegate to the appropriate example modules.
32-37: LGTM!The main function and entry point follow the established pattern correctly.
examples/dequantize_gemm/bench_example_dequantize_gemm.py (2)
10-43: LGTM—consistent benchmark wrappers.The benchmark functions correctly delegate to
process_funcand apply appropriate CUDA guards. Compute version 9.0 requirements for Hopper-specific variants are appropriate.
45-50: LGTM—standard entry point.The main function and script guard follow the established pattern.
examples/blocksparse_attention/bench_example_blocksparse_attention.py (2)
1-7: LGTM—imports are complete.All necessary imports are present.
34-39: LGTM—standard entry point.The main function and script guard follow the established pattern.
examples/attention_sink/bench_example_attention_sink.py (3)
1-7: LGTM—imports are complete.All necessary imports including
tilelang.testingare present.
10-64: LGTM—comprehensive attention sink benchmarks.The benchmark suite covers MHA and GQA variants (forward and backward) with proper CUDA guards and compute version requirements. Sliding window variants correctly pass
window_size=128.
66-71: LGTM—standard entry point.The main function and script guard follow the established pattern.
examples/flash_attention/bench_example_flash_attention.py (2)
17-85: LGTM—comprehensive flash attention benchmarks.The benchmark suite covers GQA and MHA variants (forward and backward passes, various memory layouts) with proper CUDA guards. Compute version 9.0 requirements for WGMMA-pipelined variants are appropriate.
87-92: LGTM—standard entry point.The main function and script guard follow the established pattern.
| def bench_block_sparse_attn_triton(): | ||
| tilelang.tools.bench.process_func(block_sparse_attn_triton.main) | ||
|
|
||
|
|
||
| def bench_example_tilelang_block_sparse_attn(): | ||
| tilelang.tools.bench.process_func(example_tilelang_block_sparse_attn.main) | ||
|
|
||
|
|
||
| def bench_example_tilelang_sparse_gqa_decode_varlen_indice(): | ||
| tilelang.tools.bench.process_func(example_tilelang_sparse_gqa_decode_varlen_indice.main) | ||
|
|
||
|
|
||
| def bench_example_tilelang_sparse_gqa_decode_varlen_mask(): | ||
| tilelang.tools.bench.process_func(example_tilelang_sparse_gqa_decode_varlen_mask.main) | ||
|
|
||
|
|
||
| def bench_example_triton_sparse_gqa_decode_varlen_indice(): | ||
| tilelang.tools.bench.process_func(example_triton_sparse_gqa_decode_varlen_indice.main) | ||
|
|
||
|
|
||
| def bench_example_triton_sparse_gqa_decode_varlen_mask(): | ||
| tilelang.tools.bench.process_func(example_triton_sparse_gqa_decode_varlen_mask.main) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify CUDA requirements for block-sparse attention benchmarks.
Unlike other benchmark files in this PR (e.g., bench_example_dequantize_gemm.py, bench_example_attention_sink.py), these functions lack @tilelang.testing.requires_cuda decorators. Since block-sparse attention typically requires GPU execution, confirm whether these benchmarks can run on CPU or if decorators should be added.
Run the following script to check if the imported modules have any CUDA dependencies:
🏁 Script executed:
#!/bin/bash
# Check for CUDA usage in the block sparse attention example modules
echo "=== Checking for CUDA/GPU usage in example modules ==="
for module in block_sparse_attn_triton example_tilelang_block_sparse_attn example_tilelang_sparse_gqa_decode_varlen_indice example_tilelang_sparse_gqa_decode_varlen_mask example_triton_sparse_gqa_decode_varlen_indice example_triton_sparse_gqa_decode_varlen_mask; do
echo "--- Module: $module ---"
fd "${module}.py" examples/blocksparse_attention --exec cat {} \; | head -50
doneLength of output: 9127
---
🏁 Script executed:
```shell
#!/bin/bash
# Check if requires_cuda decorator exists and how it's used in other benchmarks
echo "=== Checking for requires_cuda usage in benchmark files ==="
fd "bench_.*\.py$" examples --exec grep -l "requires_cuda" {} \;
echo -e "\n=== Checking bench_example_dequantize_gemm.py and bench_example_attention_sink.py ==="
fd "bench_example_dequantize_gemm.py|bench_example_attention_sink.py" examples --exec head -40 {} \;
echo -e "\n=== Checking what tilelang.tools.bench.process_func does ==="
rg -n "def process_func" --type py -A 10
Length of output: 4482
---
🏁 Script executed:
```shell
#!/bin/bash
# Check imports in the blocksparse benchmark file
head -20 examples/blocksparse_attention/bench_example_blocksparse_attention.py
Length of output: 713
Add @tilelang.testing.requires_cuda decorators to all benchmark functions.
The benchmark file is missing CUDA requirement decorators that are consistently used across other benchmark files in the repository. All six benchmark functions in this file call GPU-exclusive kernels (Triton and TileLang JIT-compiled functions), so they require CUDA to run.
Fixes needed:
- Add
import tilelang.testingat the top (line 1) - Add
@tilelang.testing.requires_cudadecorator to all six benchmark functions:bench_block_sparse_attn_triton()(line 10)bench_example_tilelang_block_sparse_attn()(line 14)bench_example_tilelang_sparse_gqa_decode_varlen_indice()(line 18)bench_example_tilelang_sparse_gqa_decode_varlen_mask()(line 22)bench_example_triton_sparse_gqa_decode_varlen_indice()(line 26)bench_example_triton_sparse_gqa_decode_varlen_mask()(line 30)
🤖 Prompt for AI Agents
In examples/blocksparse_attention/bench_example_blocksparse_attention.py around
lines 1 and 10-32, the file lacks the CUDA test decorator and import: add
"import tilelang.testing" at the top of the file, then prepend
"@tilelang.testing.requires_cuda" to each of the six benchmark functions at the
specified lines — bench_block_sparse_attn_triton() (line 10),
bench_example_tilelang_block_sparse_attn() (line 14),
bench_example_tilelang_sparse_gqa_decode_varlen_indice() (line 18),
bench_example_tilelang_sparse_gqa_decode_varlen_mask() (line 22),
bench_example_triton_sparse_gqa_decode_varlen_indice() (line 26), and
bench_example_triton_sparse_gqa_decode_varlen_mask() (line 30) — so each
GPU-only benchmark is skipped when CUDA is unavailable.
| import tilelang.tools.bench | ||
| import example_deepgemm_fp8_2xAcc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for decorator module.
Lines 5-6 use @tilelang.testing.requires_cuda and @tilelang.testing.requires_cuda_compute_version_eq decorators, but tilelang.testing is not imported.
Add the missing import:
import tilelang.tools.bench
+import tilelang.testing
import example_deepgemm_fp8_2xAcc📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import tilelang.tools.bench | |
| import example_deepgemm_fp8_2xAcc | |
| import tilelang.tools.bench | |
| import tilelang.testing | |
| import example_deepgemm_fp8_2xAcc |
🤖 Prompt for AI Agents
In examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py around lines 1
to 6, the decorators @tilelang.testing.requires_cuda and
@tilelang.testing.requires_cuda_compute_version_eq are used but tilelang.testing
is not imported; add an import for the testing module (e.g., import
tilelang.testing or from tilelang import testing) near the top of the file
before the decorator usage so the decorators resolve correctly.
| import tilelang.tools.bench | ||
| import example_mla_decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Missing import for decorators.
The decorators on lines 5-6 reference tilelang.testing but the module is not imported, which will cause a NameError when the module loads.
Apply this diff to add the missing import:
import tilelang.tools.bench
+import tilelang.testing
import example_mla_decode📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import tilelang.tools.bench | |
| import example_mla_decode | |
| import tilelang.tools.bench | |
| import tilelang.testing | |
| import example_mla_decode |
🤖 Prompt for AI Agents
In examples/deepseek_mla/bench_example_mla_decode.py around lines 1 to 2, the
decorators later reference tilelang.testing but the module isn't imported,
causing a NameError; add an import for the testing module (e.g., import
tilelang.testing) at the top of the file so the decorators can resolve
correctly.
| import tilelang.tools.bench | ||
| import fp8_lighting_indexer | ||
| import sparse_mla_bwd | ||
| import sparse_mla_fwd | ||
| import sparse_mla_fwd_pipelined | ||
| import topk_selector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for decorator module.
Lines 17-18, 23-24, and 29-30 use @tilelang.testing decorators, but tilelang.testing is not imported.
Add the missing import:
import tilelang.tools.bench
+import tilelang.testing
import fp8_lighting_indexer
import sparse_mla_bwd
import sparse_mla_fwd
import sparse_mla_fwd_pipelined
import topk_selector📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import tilelang.tools.bench | |
| import fp8_lighting_indexer | |
| import sparse_mla_bwd | |
| import sparse_mla_fwd | |
| import sparse_mla_fwd_pipelined | |
| import topk_selector | |
| import tilelang.tools.bench | |
| import tilelang.testing | |
| import fp8_lighting_indexer | |
| import sparse_mla_bwd | |
| import sparse_mla_fwd | |
| import sparse_mla_fwd_pipelined | |
| import topk_selector |
🤖 Prompt for AI Agents
In examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py around lines 1-6
(decorators used at lines ~17-30), the module providing the @tilelang.testing
decorator isn't imported; add an import for the testing module (e.g., import
tilelang.testing) near the other top-level imports so the @tilelang.testing
decorators resolve.
| import tilelang.tools.bench | ||
| import block_sparse_attn_tilelang | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for decorator.
The @tilelang.testing.requires_cuda decorator on line 5 requires importing tilelang.testing, which is currently missing.
Apply this diff to add the missing import:
import tilelang.tools.bench
+import tilelang.testing
import block_sparse_attn_tilelang📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import tilelang.tools.bench | |
| import block_sparse_attn_tilelang | |
| import tilelang.tools.bench | |
| import tilelang.testing | |
| import block_sparse_attn_tilelang | |
🤖 Prompt for AI Agents
In examples/seer_attention/bench_block_sparse_attn_tilelang.py around lines 1 to
3, the file uses the @tilelang.testing.requires_cuda decorator but does not
import tilelang.testing; add an import for tilelang.testing (e.g., import
tilelang.testing) near the top alongside the existing imports so the decorator
is resolved.
| @tilelang.testing.requires_cuda | ||
| @tilelang.testing.requires_cuda_compute_version(9, 0) | ||
| def bench_example_sparse_tensorcore(): | ||
| tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify runtime CUDA requirement enforcement.
The pytest decorators (@requires_cuda, @requires_cuda_compute_version) only add metadata for pytest runs. When executing this script standalone via bench.main(), these decorators won't skip execution if requirements aren't met—the benchmark will attempt to run and likely fail with an unclear error.
Run this script to check if bench.process_func or the target module has built-in CUDA validation:
🏁 Script executed:
#!/bin/bash
# Check if bench.process_func or example modules validate CUDA requirements at runtime
# Check bench.process_func implementation for CUDA checks
ast-grep --pattern $'def process_func($$$) {
$$$
}' | head -100
# Check if the example module has runtime CUDA guards
rg -n --type=py -C3 'cuda.*available|torch.*cuda.*is_available|requires.*cuda' tilelang_example_sparse_tensorcore.pyLength of output: 77
🏁 Script executed:
#!/bin/bash
# Find relevant files: bench module and example modules
echo "=== Finding bench-related files ==="
fd -e py 'bench' --type f | head -20
echo ""
echo "=== Finding example_sparse_tensorcore file ==="
fd 'sparse' --type f | head -20
echo ""
echo "=== Searching for process_func definition ==="
rg -n 'def process_func' --type py -A 10 | head -40Length of output: 3080
🏁 Script executed:
#!/bin/bash
# Find sparse_tensorcore specifically and check full process_func
echo "=== Finding sparse_tensorcore files ==="
fd 'sparse_tensorcore' --type f
echo ""
echo "=== Checking examples directory structure ==="
ls -la examples/ | grep -i sparse || echo "No sparse directories in examples"
echo ""
echo "=== Getting full process_func implementation ==="
rg -n 'def process_func' --type py -A 30 tilelang/tools/bench.py
echo ""
echo "=== Checking for CUDA validation in bench.py ==="
rg -n 'cuda|requires_cuda' --type py tilelang/tools/bench.pyLength of output: 1653
🏁 Script executed:
#!/bin/bash
# Check the actual example module and benchmark function for runtime CUDA validation
echo "=== Contents of bench_example_sparse_tensorcore.py ==="
cat examples/sparse_tensorcore/bench_example_sparse_tensorcore.py
echo ""
echo "=== Checking tilelang_example_sparse_tensorcore.py for CUDA guards ==="
cat examples/sparse_tensorcore/tilelang_example_sparse_tensorcore.py | head -50
echo ""
echo "=== Search for CUDA runtime checks in example module ==="
rg -n 'cuda|device|requires' --type py examples/sparse_tensorcore/tilelang_example_sparse_tensorcore.py | head -20Length of output: 2519
🏁 Script executed:
#!/bin/bash
# Check bench.main() for decorator/CUDA handling
echo "=== Search for bench.main() implementation ==="
rg -n 'def main' --type py tilelang/tools/bench.py -A 20
echo ""
echo "=== Check if bench module handles decorators ==="
rg -n 'requires_cuda|decorator|mark' --type py tilelang/tools/bench.pyLength of output: 462
Add runtime CUDA validation to benchmark execution.
Pytest decorators (@requires_cuda, @requires_cuda_compute_version) are metadata-only and don't enforce requirements when bench.main() executes the file via exec() (tilelang/tools/bench.py:66). When bench_example_sparse_tensorcore() runs, it bypasses decorator logic and calls tilelang_example_sparse_tensorcore.main(), which unconditionally creates CUDA tensors (lines 109-111). On incompatible hardware, this fails with a confusing traceback instead of a clear skip message.
Add runtime CUDA validation to bench_example_sparse_tensorcore() before calling process_func():
@tilelang.testing.requires_cuda
@tilelang.testing.requires_cuda_compute_version(9, 0)
def bench_example_sparse_tensorcore():
if not torch.cuda.is_available():
print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA")
return
if torch.cuda.get_device_capability()[0] < 9:
print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires compute capability >= 9.0")
return
tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)🤖 Prompt for AI Agents
In examples/sparse_tensorcore/bench_example_sparse_tensorcore.py around lines 6
to 9, the benchmark relies on pytest decorators only, but the test runner calls
the example via exec so decorators don't prevent execution on non-CUDA hardware;
add explicit runtime CUDA validation before calling
tilelang.tools.bench.process_func: check torch.cuda.is_available() and if false
print a clear "SKIPPED: {__name__}.bench_example_sparse_tensorcore requires
CUDA" message and return; then check torch.cuda.get_device_capability()[0]
(major) and if < 9 print "SKIPPED: {__name__}.bench_example_sparse_tensorcore
requires compute capability >= 9.0" and return; keep the existing decorators and
only call process_func after both checks pass.
| import tilelang.tools.bench | ||
| import example_topk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Missing import for decorators.
The decorator on line 5 references tilelang.testing but the module is not imported, which will cause a NameError when the module loads.
Apply this diff to add the missing import:
import tilelang.tools.bench
+import tilelang.testing
import example_topk📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import tilelang.tools.bench | |
| import example_topk | |
| import tilelang.tools.bench | |
| import tilelang.testing | |
| import example_topk |
🤖 Prompt for AI Agents
In examples/topk/bench_topk_tilelang.py around lines 1 to 2, the file imports
tilelang.tools.bench and example_topk but does not import tilelang.testing which
the decorator on line 5 references; add a top-level import for tilelang.testing
(e.g., import tilelang.testing) alongside the existing imports so the decorator
can resolve without a NameError.
| import tilelang.tools.bench | ||
| import example_warp_specialize_gemm_barrierpipe_stage2 | ||
| import example_warp_specialize_gemm_copy_0_gemm_1 | ||
| import example_warp_specialize_gemm_copy_1_gemm_0 | ||
| import example_warp_specialize_gemm_softpipe_stage2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import for decorator module.
Lines 8-9, 14-15, 20-21, and 26-27 use @tilelang.testing decorators, but tilelang.testing is not imported.
Add the missing import:
import tilelang.tools.bench
+import tilelang.testing
import example_warp_specialize_gemm_barrierpipe_stage2
import example_warp_specialize_gemm_copy_0_gemm_1
import example_warp_specialize_gemm_copy_1_gemm_0
import example_warp_specialize_gemm_softpipe_stage2📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import tilelang.tools.bench | |
| import example_warp_specialize_gemm_barrierpipe_stage2 | |
| import example_warp_specialize_gemm_copy_0_gemm_1 | |
| import example_warp_specialize_gemm_copy_1_gemm_0 | |
| import example_warp_specialize_gemm_softpipe_stage2 | |
| import tilelang.tools.bench | |
| import tilelang.testing | |
| import example_warp_specialize_gemm_barrierpipe_stage2 | |
| import example_warp_specialize_gemm_copy_0_gemm_1 | |
| import example_warp_specialize_gemm_copy_1_gemm_0 | |
| import example_warp_specialize_gemm_softpipe_stage2 |
🤖 Prompt for AI Agents
In examples/warp_specialize/bench_example_warp_specialize.py around lines 1 to
5, the module uses @tilelang.testing decorators later but never imports
tilelang.testing; add an import for the testing decorator (e.g., import
tilelang.testing or from tilelang import testing) near the top with the other
imports so the decorator references resolve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (15)
examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py(1 hunks)examples/deepseek_mla/bench_example_mla_decode.py(1 hunks)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py(1 hunks)examples/dequantize_gemm/bench_example_dequantize_gemm.py(1 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)examples/flash_decoding/bench_example_flash_decoding.py(1 hunks)examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py(1 hunks)examples/linear_attention/bench_linear_attn.py(1 hunks)examples/minference/bench_vs_sparse_attn.py(1 hunks)examples/seer_attention/bench_block_sparse_attn_tilelang.py(1 hunks)examples/sparse_tensorcore/bench_example_sparse_tensorcore.py(1 hunks)examples/topk/bench_topk_tilelang.py(1 hunks)examples/warp_specialize/bench_example_warp_specialize.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (9)
- examples/topk/bench_topk_tilelang.py
- examples/linear_attention/bench_linear_attn.py
- examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
- examples/seer_attention/bench_block_sparse_attn_tilelang.py
- examples/dequantize_gemm/bench_example_dequantize_gemm.py
- examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py
- examples/convolution/bench_example_convolution.py
- examples/sparse_tensorcore/bench_example_sparse_tensorcore.py
- examples/flash_decoding/bench_example_flash_decoding.py
🧰 Additional context used
🧬 Code graph analysis (6)
examples/minference/bench_vs_sparse_attn.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_mla/bench_example_mla_decode.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/attention_sink/bench_example_attention_sink.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/warp_specialize/bench_example_warp_specialize.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (6)
examples/deepseek_v32/sparse_mla_bwd.py (2)
sparse_mla_bwd(283-320)test_sparse_mla_bwd(334-384)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)examples/deepseek_v32/topk_selector.py (1)
test_topk_selector(188-245)examples/deepseek_v32/fp8_lighting_indexer.py (1)
test_fp8_lighting_indexer(260-302)examples/deepseek_v32/sparse_mla_fwd.py (1)
test_sparse_mla_fwd(253-299)examples/deepseek_v32/sparse_mla_fwd_pipelined.py (1)
test_sparse_mla_fwd_pipelined(400-452)
examples/flash_attention/bench_example_flash_attention.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
🔇 Additional comments (15)
examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py (4)
1-6: Past review comment is no longer applicable.The previous review flagged a missing
tilelang.testingimport for decorators, but the current code contains no decorators. This concern has been resolved.
9-26: LGTM!The benchmark wrapper functions follow a consistent pattern, delegating to
process_funcwith the appropriate test functions. All target test functions have default parameters, so calling them without arguments is safe.
29-30: LGTM!The main function correctly delegates to the benchmarking framework, which will auto-discover and execute all
bench_*functions defined in this module.
33-34: LGTM!Standard Python entry point pattern correctly implemented.
examples/warp_specialize/bench_example_warp_specialize.py (3)
1-5: LGTM! Past review issue resolved.The imports are clean and correct. The previous review flagged missing
tilelang.testingfor decorators, but the current code has removed those decorators entirely, resolving the issue.
8-21: Correct usage of the benchmarking framework.All four benchmark functions correctly follow the framework pattern: each wraps an example module's
main()withprocess_funcand uses thebench_*naming convention for auto-discovery.
24-29: LGTM!The
main()function correctly delegates to the benchmarking framework, and the entry point follows standard Python conventions.examples/minference/bench_vs_sparse_attn.py (4)
5-6: LGTM!The bench function correctly wraps the example module's
main()usingprocess_func, following the established pattern across the PR. The function reference is passed correctly as a callable without invoking it.
9-10: LGTM!The
main()function correctly delegates totilelang.tools.bench.main(), which will auto-discover and execute allbench_*functions in this module.
13-14: LGTM!Standard Python entry point idiom is correctly implemented.
1-2: Imports are correct and verified.The
example_vertical_slash_sparse_attnmodule exists in the same directory and itsmain()function accepts no required arguments (argv=None), confirming it can be called without parameters as intended by the benchmarking wrapper.examples/deepseek_mla/bench_example_mla_decode.py (4)
1-2: LGTM! Imports are correct.The imports are appropriate for the benchmarking wrapper pattern. Note that the previous review comment about missing
tilelang.testingis no longer applicable since there are no decorators in the current code.
9-10: LGTM! Standard entry point pattern.The
main()function correctly delegates tobench.main(), which will auto-discover and execute allbench_*functions in this module.
13-14: LGTM! Standard Python entry point.The
if __name__ == "__main__"guard is correct and allows the script to be executed directly or imported as a module.
5-6: Verification confirms correct implementation.
example_mla_decode.mainexists at line 283 inexamples/deepseek_mla/example_mla_decode.pyand is properly referenced. The function correctly passes the function reference toprocess_func, and thebench_*naming convention enables auto-discovery. No issues found.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
examples/convolution/bench_example_convolution.py (1)
1-3: Past review comment appears outdated.The existing review claims that lines 6-7 use
@tilelang.testing.requires_cudadecorators and require importingtilelang.testing. However, the current code shows no decorators on lines 6-7—just plain function definitions. The current imports are sufficient for the code as written.examples/attention_sink/bench_example_attention_sink.py (1)
55-56: Same__name__guard issue—fix in bench.py is preferred.The past review correctly identifies that this guard will raise
NameErrorwhentilelang.tools.bench.main()re-executes the file viaexec()with an undefined__name__. However, the suggestedglobals().get("__name__")workaround would need to be duplicated across all ~30 bench scripts.The root cause is in
tilelang/tools/bench.pylines 64-66 where the exec namespace is created empty. Fixing it there (as commented in my review) will resolve the issue for all bench scripts at once.
🧹 Nitpick comments (1)
tilelang/tools/bench.py (1)
63-66: Add defensive checks for frame introspection and file operations.Several robustness issues:
inspect.getsourcefile()can returnNoneif the frame lacks source information, which would cause line 65 to fail.- The file is opened without explicit encoding (platform default will be used).
sys._getframe(1)is CPython-specific (underscore prefix indicates private API).While the
execusage is flagged by static analysis (S102), it's likely acceptable for a benchmarking tool in a controlled environment.Consider adding defensive checks:
def main(): # Entry point — automatically run all bench_* functions in caller file. test_file = inspect.getsourcefile(sys._getframe(1)) + if test_file is None: + raise RuntimeError("Cannot determine source file for caller") module = {} - with open(test_file) as f: + with open(test_file, encoding="utf-8") as f: exec(f.read(), module)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)tilelang/tools/bench.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tilelang/tools/bench.py (1)
examples/attention_sink/bench_example_attention_sink.py (1)
main(51-52)
examples/convolution/bench_example_convolution.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/attention_sink/bench_example_attention_sink.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
🪛 Ruff (0.14.4)
tilelang/tools/bench.py
33-34: try-except-pass detected, consider logging the exception
(S110)
33-33: Do not catch blind exception: Exception
(BLE001)
45-45: Do not catch blind exception: Exception
(BLE001)
66-66: Use of exec detected
(S102)
🔇 Additional comments (1)
examples/attention_sink/bench_example_attention_sink.py (1)
9-48: LGTM! Clean benchmark organization.The benchmark functions follow a consistent pattern:
- Base configurations call
process_func(module.main)directly- Sliding window variants pass
window_size=128as a kwarg- Clear naming convention distinguishes MHA/GQA, forward/backward, and different implementations
The pattern assumes the underlying
main()functions accept an optionalwindow_sizeparameter, which should be verified during testing.
| module = {} | ||
| with open(test_file) as f: | ||
| exec(f.read(), module) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: exec namespace breaks __name__ guard in bench scripts.
The empty module = {} dict used as the exec namespace will cause NameError when bench scripts contain if __name__ == "__main__": guards. Python's exec will auto-populate __builtins__ but not __name__, __file__, or other module-level variables.
This explains the past review comment on examples/attention_sink/bench_example_attention_sink.py lines 55-56. All bench scripts using this pattern will fail at runtime when tilelang.tools.bench.main() re-executes them.
Fix by pre-populating the exec namespace with standard module globals:
def main():
# Entry point — automatically run all bench_* functions in caller file.
test_file = inspect.getsourcefile(sys._getframe(1))
- module = {}
+ module = {"__name__": "__benchmarked__", "__file__": test_file}
with open(test_file) as f:
exec(f.read(), module)Note: Using "__benchmarked__" (or any value other than "__main__") ensures the guard skips execution during discovery, preventing infinite recursion.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| module = {} | |
| with open(test_file) as f: | |
| exec(f.read(), module) | |
| module = {"__name__": "__benchmarked__", "__file__": test_file} | |
| with open(test_file) as f: | |
| exec(f.read(), module) |
🧰 Tools
🪛 Ruff (0.14.4)
66-66: Use of exec detected
(S102)
🤖 Prompt for AI Agents
In tilelang/tools/bench.py around lines 64–66, the current exec(namespace) uses
module = {} which doesn't set module-level globals like __name__ or __file__,
breaking scripts that use if __name__ == "__main__": guards; to fix,
pre-populate the exec namespace with standard module globals before exec, e.g.
set '__name__' to a non-main sentinel such as '__benchmarked__' (so discovery
doesn't trigger main), set '__file__' to the test file path (and optionally
'__package__' if needed), then exec the file into that namespace; replace module
= {} with this pre-populated dict so bench scripts run without NameError and
avoid re-running their main during discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (8)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
6-7: Add runtime CUDA validation before callingprocess_func.As flagged in a previous review, this benchmark lacks runtime validation of CUDA requirements. When executed via
bench.main(), the script bypasses pytest decorator checks and will attempt to run on incompatible hardware, resulting in unclear error messages rather than a clean skip notification.Add explicit checks before calling
process_func:+import torch + def bench_example_sparse_tensorcore(): + if not torch.cuda.is_available(): + print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA") + return + if torch.cuda.get_device_capability()[0] < 9: + print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires compute capability >= 9.0") + return tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)examples/gemm/bench_gemm.py (1)
24-25: Consider using standard__name__check.Same as the comment on
examples/dynamic_shape/bench_example_dynamic.py: the standardif __name__ == "__main__":is more idiomatic.examples/gemm/bench_example_gemm.py (2)
24-25: Consider using standard__name__check.Same as previous files: prefer
if __name__ == "__main__":overglobals().get("__name__") == "__main__".
1-25: Critical: Duplicate benchmark file detected.This file is a duplicate of
examples/gemm/bench_gemm.pyin the same directory. Refer to the comment on that file for details.examples/cast/bench_example_cast.py (1)
14-15: Consider using standard__name__check.Same suggestion as other files: use
if __name__ == "__main__":for better readability.examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (2)
9-10: Consider using standard__name__check.Same suggestion as other files.
1-6: Critical: Filename mismatch remains unresolved.The past review correctly identified that the filename uses "splitk" but the imported module and function name use "streamk". This naming inconsistency should be resolved by renaming the file to
bench_example_tilelang_gemm_streamk.py.examples/flash_decoding/bench_example_flash_decoding.py (1)
14-15: Consider using standard__name__check.Same suggestion as other files: prefer the standard
if __name__ == "__main__":idiom.
🧹 Nitpick comments (8)
examples/deepseek_mla/bench_example_mla_decode.py (1)
9-10: Use the idiomatic main guard.The
globals().get("__name__")pattern is unnecessarily complex. Use the standard Python idiom for better readability.Apply this diff:
-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()examples/warp_specialize/bench_example_warp_specialize.py (1)
24-25: Use the standard Python idiom for the main guard.The
globals().get("__name__")pattern is unnecessarily defensive and non-idiomatic. The__name__variable is always defined in Python modules.Apply this diff to use the standard idiom:
-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
34-35: Use the standard Python idiom for the main guard.The
globals().get("__name__")pattern is unconventional and less clear than the standard Python idiom. The__name__variable is always available at module level, so the.get()call adds unnecessary complexity.Apply this diff to use the standard idiom:
-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (2)
1-3: Consider removing unusedtilelangimport.The
tilelangmodule is imported but not directly used in this script. Unless it's required for side effects or future decorator additions, consider removing it.-import tilelang.tools.bench -import tilelang -import tilelang_example_sparse_tensorcore +import tilelang.tools.bench +import tilelang_example_sparse_tensorcore
10-11: Simplify main guard to use standard Python idiom.The
globals().get("__name__")pattern works but is unnecessarily indirect. The standard__name__check is clearer and more idiomatic.-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
9-10: Use standard Python idiom for main guard.The pattern
globals().get("__name__")is unnecessarily complex. The standard Python idiom is clearer and more widely recognized.Apply this diff:
-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)
14-15: Simplify to standard Python idiom.The use of
globals().get("__name__")is unnecessarily verbose. Use the standard__name__check instead.Apply this diff:
-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()examples/dynamic_shape/bench_example_dynamic.py (1)
9-10: Consider using standard__name__check.The
globals().get("__name__")pattern is unnecessarily verbose. The standard Python idiomif __name__ == "__main__":is more readable and widely recognized.Apply this diff:
-if globals().get("__name__") == "__main__": +if __name__ == "__main__": tilelang.tools.bench.main()
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
examples/analyze/bench_example_analyze.py(1 hunks)examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/blocksparse_attention/bench_example_blocksparse_attention.py(1 hunks)examples/blocksparse_gemm/bench_example_blocksparse_gemm.py(1 hunks)examples/cast/bench_example_cast.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py(1 hunks)examples/deepseek_mla/bench_example_mla_decode.py(1 hunks)examples/deepseek_nsa/bench_example_tilelang_nsa.py(1 hunks)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py(1 hunks)examples/dequantize_gemm/bench_example_dequantize_gemm.py(1 hunks)examples/dynamic_shape/bench_example_dynamic.py(1 hunks)examples/elementwise/bench_example_elementwise.py(1 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)examples/flash_decoding/bench_example_flash_decoding.py(1 hunks)examples/fusedmoe/bench_example_fusedmoe.py(1 hunks)examples/gemm/bench_example_gemm.py(1 hunks)examples/gemm/bench_gemm.py(1 hunks)examples/gemm_fp8/bench_example_gemm_fp8.py(1 hunks)examples/gemm_splitk/bench_example_gemm_splitk.py(1 hunks)examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py(1 hunks)examples/gemv/bench_example_gemv.py(1 hunks)examples/linear_attention/bench_linear_attn.py(1 hunks)examples/minference/bench_vs_sparse_attn.py(1 hunks)examples/seer_attention/bench_block_sparse_attn_tilelang.py(1 hunks)examples/sparse_tensorcore/bench_example_sparse_tensorcore.py(1 hunks)examples/topk/bench_topk_tilelang.py(1 hunks)examples/warp_specialize/bench_example_warp_specialize.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
- examples/blocksparse_gemm/bench_example_blocksparse_gemm.py
- examples/gemv/bench_example_gemv.py
- examples/elementwise/bench_example_elementwise.py
- examples/seer_attention/bench_block_sparse_attn_tilelang.py
- examples/topk/bench_topk_tilelang.py
- examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
- examples/linear_attention/bench_linear_attn.py
- examples/gemm_fp8/bench_example_gemm_fp8.py
- examples/fusedmoe/bench_example_fusedmoe.py
- examples/minference/bench_vs_sparse_attn.py
- examples/dequantize_gemm/bench_example_dequantize_gemm.py
- examples/gemm_splitk/bench_example_gemm_splitk.py
- examples/convolution/bench_example_convolution.py
🧰 Additional context used
🧬 Code graph analysis (15)
examples/deepseek_mla/bench_example_mla_decode.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/dynamic_shape/bench_example_dynamic.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/flash_decoding/bench_example_flash_decoding.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/attention_sink/bench_example_attention_sink.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/analyze/bench_example_analyze.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_nsa/bench_example_tilelang_nsa.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemm/bench_example_gemm.py (2)
examples/gemm/bench_gemm.py (4)
bench_example_gemm_autotune(8-9)bench_example_gemm_intrinsics(12-13)bench_example_gemm_schedule(16-17)bench_example_gemm(20-21)tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/cast/bench_example_cast.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/gemm/bench_gemm.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/warp_specialize/bench_example_warp_specialize.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
examples/flash_attention/bench_example_flash_attention.py (1)
tilelang/tools/bench.py (2)
process_func(27-58)main(61-70)
🔇 Additional comments (17)
examples/deepseek_mla/bench_example_mla_decode.py (2)
1-2: Imports look correct.The necessary modules are imported. Note that the past review comment about missing
tilelang.testingimport appears to be outdated, as there are no decorators in the current code that would require it.
5-6: LGTM!The benchmark function correctly delegates to
process_funcand follows the established pattern used across other benchmark wrappers in this PR.examples/warp_specialize/bench_example_warp_specialize.py (2)
1-5: Imports look good.The required imports are present and correctly structured. The past review concern about missing
tilelang.testingappears to have been resolved by removing the decorators.
8-21: Benchmark functions follow the expected pattern.Each function correctly delegates to the imported module's
mainviaprocess_func. Using the defaultrepeat=10andwarmup=3parameters is reasonable for general benchmarking.examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (2)
1-2: Imports look correct.Both imports are properly used in the benchmarking wrapper.
5-6: Benchmark function correctly delegates to the benchmarking utility.The implementation properly passes the function object to
process_func, which will handle the repeated execution and timing.examples/analyze/bench_example_analyze.py (3)
1-3: LGTM!The imports are clean and correctly reference the benchmark utility and the example modules to be benchmarked.
6-11: LGTM!Both benchmark functions correctly wrap their respective example modules'
main()functions using the benchmarking utility pattern.
14-15: LGTM!The
globals().get("__name__")guard correctly handles both direct script execution and re-execution viaexec()in the benchmark harness.examples/flash_attention/bench_example_flash_attention.py (3)
1-14: LGTM!All imports are correctly referenced by the benchmark functions below. The import list comprehensively covers GQA/MHA variants for forward and backward passes.
17-66: LGTM!All 12 benchmark functions consistently follow the correct pattern, wrapping their respective example modules'
main()functions with the benchmarking utility.
69-70: LGTM!The script entry guard correctly uses the
globals().get("__name__")pattern to handle both direct execution and benchmark harness re-execution.examples/deepseek_nsa/bench_example_tilelang_nsa.py (2)
10-11: LGTM!The benchmark wrapper follows the same pattern as the first function and is consistent with the framework design.
6-7: No actionable issues found; code correctly implements the assumed contract.The wrapped function
example_tilelang_nsa_fwd.mainis properly imported (line 2) and correctly referenced in the benchmark wrapper. The main function is defined with no required parameters (line 135 of example_tilelang_nsa_fwd.py), and the call at line 7 of the bench file correctly passes it toprocess_funcwithout arguments.examples/attention_sink/bench_example_attention_sink.py (3)
51-52: LGTM! The critical__name__issue has been resolved.The entry point correctly uses
globals().get("__name__")instead of directly accessing__name__, which prevents theNameErrorwhen the bench harness execs this file in an empty namespace. This allows the script to run standalone while skipping the guard cleanly during harness execution.
9-48: LGTM! All targetmain()functions correctly accept thewindow_sizeparameter.All five target example modules'
main()functions have thewindow_size: Optional[int] = Noneparameter and properly handle it. The bench wrapper functions correctly delegate toprocess_func, and the_sliding_windowvariants appropriately passwindow_size=128to their target functions.
1-6: I need to examine the actual repository structure to verify whether this is part of the TileLang repository or a separate project.This file is part of the TileLang repository, a domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels. The test failure shows that
tilelangis not installed in the test environment, which is the expected reason the script failed to run—this is an environment setup issue, not a code problem.The imports in lines 1-6 are correct. tilelang.tools.bench is part of the standard tilelang package structure, and when tilelang is properly installed (as intended in the TileLang repository context), these imports will resolve correctly. The example modules imported in lines 2-6 are local to the examples directory and are designed to be imported when the benchmark script runs in the proper environment.
The review comment incorrectly frames this as an environment-dependent sys.path resolution issue requiring special invocation, when in fact the imports follow standard Python package conventions and will work correctly once tilelang is installed.
Likely an incorrect or invalid review comment.
| import example_tilelang_nsa_fwd | ||
| import example_tilelang_nsa_decode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify that the example modules are importable.
The relative imports assume the example modules are in the same directory or on PYTHONPATH. Ensure that these modules exist and can be imported when the script is run.
Run the following script to verify the modules exist:
🏁 Script executed:
#!/bin/bash
# Description: Verify the example modules exist in the expected location
# Check if the example modules exist
fd -t f "example_tilelang_nsa_fwd.py" examples/deepseek_nsa/
fd -t f "example_tilelang_nsa_decode.py" examples/deepseek_nsa/Length of output: 102
---
🏁 Script executed:
```shell
#!/bin/bash
# Check package structure and verify imports
fd -t f "__init__.py" examples/deepseek_nsa/
fd -t f "__init__.py" examples/
# Check the actual bench_example file structure
head -20 examples/deepseek_nsa/bench_example_tilelang_nsa.py
Length of output: 400
---
🏁 Script executed:
```shell
#!/bin/bash
# Check for setup files or documentation about running examples
fd -t f "setup.py|pyproject.toml|README|run" examples/deepseek_nsa/ -o examples/
# Check if there's any sys.path manipulation in the script or parent modules
rg -n "sys.path" examples/deepseek_nsa/
Length of output: 137
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for setup/documentation files
find examples/deepseek_nsa -type f -name "*.py" | head -20
find examples -maxdepth 2 -type f \( -name "setup.py" -o -name "pyproject.toml" -o -name "README*" \)
# Check for sys.path manipulation
rg -n "sys.path" examples/
Length of output: 2643
---
🏁 Script executed:
```shell
#!/bin/bash
# Check how test_example_tilelang_nsa.py handles imports
head -10 examples/deepseek_nsa/test_example_tilelang_nsa.py
# Also check if any other benchmark files exist and how they handle imports
find examples/deepseek_nsa -name "bench_*.py" -type f
head -10 examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py 2>/dev/null || echo "File not found or error"
Length of output: 464
Fix bare imports to use relative syntax or add sys.path setup.
Lines 2-3 use bare absolute imports that will fail at runtime unless examples/deepseek_nsa/ is on sys.path. Since this directory is not a package (no __init__.py), use relative imports instead:
from . import example_tilelang_nsa_fwd
from . import example_tilelang_nsa_decodeAlternatively, add to the top of the file:
import sys
import os
sys.path.insert(0, os.path.dirname(__file__))The test file (test_example_tilelang_nsa.py) in the same directory has the same issue.
🤖 Prompt for AI Agents
In examples/deepseek_nsa/bench_example_tilelang_nsa.py around lines 2-3 the bare
absolute imports will fail because the directory is not on sys.path and isn’t a
package; change to relative imports (from . import example_tilelang_nsa_fwd and
from . import example_tilelang_nsa_decode) or, if you prefer module-style
imports, add at the top: import sys, os and sys.path.insert(0,
os.path.dirname(__file__)) so the local modules can be found (apply the same fix
to test_example_tilelang_nsa.py).
|
@yyttt6 would you mind give us a doc about this part? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
tilelang/tools/bench.py (1)
104-106: Critical: exec namespace breaks__name__guard in bench scripts.The empty
module = {}dict causesNameErrorwhen bench scripts contain standardif __name__ == "__main__":guards. Python's exec populates__builtins__but not__name__,__file__, or other module-level variables.This is why all benchmark scripts use the workaround
globals().get("__name__")instead of the standard__name__check.Fix by pre-populating the exec namespace:
def main(): # Entry point — automatically run all bench_* functions in caller file. test_file = inspect.getsourcefile(sys._getframe(1)) - module = {} + module = {"__name__": "__benchmarked__", "__file__": test_file, "__builtins__": __builtins__} with open(test_file) as f: exec(f.read(), module)Note: Using
"__benchmarked__"(not"__main__") prevents the guard from triggering during discovery, avoiding infinite recursion.
🧹 Nitpick comments (1)
tilelang/tools/bench.py (1)
48-50: Blind exception catching may mask bugs in benchmark functions.While printing the traceback, catching all exceptions without distinguishing between transient errors (e.g., CUDA OOM) and persistent bugs (e.g., TypeError from incorrect arguments) makes debugging harder.
Consider limiting the exception scope or adding more detailed error classification in a future iteration.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (28)
examples/analyze/bench_example_analyze.py(1 hunks)examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/blocksparse_attention/bench_example_blocksparse_attention.py(1 hunks)examples/blocksparse_gemm/bench_example_blocksparse_gemm.py(1 hunks)examples/cast/bench_example_cast.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py(1 hunks)examples/deepseek_mla/bench_example_mla_decode.py(1 hunks)examples/deepseek_nsa/bench_example_tilelang_nsa.py(1 hunks)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py(1 hunks)examples/dequantize_gemm/bench_example_dequantize_gemm.py(1 hunks)examples/dynamic_shape/bench_example_dynamic.py(1 hunks)examples/elementwise/bench_example_elementwise.py(1 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)examples/flash_decoding/bench_example_flash_decoding.py(1 hunks)examples/fusedmoe/bench_example_fusedmoe.py(1 hunks)examples/gemm/bench_example_gemm.py(1 hunks)examples/gemm_fp8/bench_example_gemm_fp8.py(1 hunks)examples/gemm_splitk/bench_example_gemm_splitk.py(1 hunks)examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py(1 hunks)examples/gemv/bench_example_gemv.py(1 hunks)examples/linear_attention/bench_linear_attn.py(1 hunks)examples/minference/bench_vs_sparse_attn.py(1 hunks)examples/seer_attention/bench_block_sparse_attn_tilelang.py(1 hunks)examples/sparse_tensorcore/bench_example_sparse_tensorcore.py(1 hunks)examples/topk/bench_topk_tilelang.py(1 hunks)examples/warp_specialize/bench_example_warp_specialize.py(1 hunks)tilelang/tools/bench.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (14)
- examples/analyze/bench_example_analyze.py
- examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py
- examples/sparse_tensorcore/bench_example_sparse_tensorcore.py
- examples/gemm_splitk/bench_example_gemm_splitk.py
- examples/minference/bench_vs_sparse_attn.py
- examples/deepseek_nsa/bench_example_tilelang_nsa.py
- examples/seer_attention/bench_block_sparse_attn_tilelang.py
- examples/gemm/bench_example_gemm.py
- examples/elementwise/bench_example_elementwise.py
- examples/linear_attention/bench_linear_attn.py
- examples/topk/bench_topk_tilelang.py
- examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
- examples/dequantize_gemm/bench_example_dequantize_gemm.py
- examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
🧰 Additional context used
🧬 Code graph analysis (13)
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/deepseek_mla/bench_example_mla_decode.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/flash_decoding/bench_example_flash_decoding.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/cast/bench_example_cast.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/gemv/bench_example_gemv.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/convolution/bench_example_convolution.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/fusedmoe/bench_example_fusedmoe.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/attention_sink/bench_example_attention_sink.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/dynamic_shape/bench_example_dynamic.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/flash_attention/bench_example_flash_attention.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
examples/warp_specialize/bench_example_warp_specialize.py (1)
tilelang/tools/bench.py (2)
process_func(30-68)main(101-112)
🪛 Ruff (0.14.4)
tilelang/tools/bench.py
36-37: try-except-pass detected, consider logging the exception
(S110)
36-36: Do not catch blind exception: Exception
(BLE001)
48-48: Do not catch blind exception: Exception
(BLE001)
106-106: Use of exec detected
(S102)
🔇 Additional comments (13)
examples/cast/bench_example_cast.py (1)
1-15: LGTM! Benchmark wrapper follows the standard pattern.The use of
globals().get("__name__")in the guard is the correct workaround for the exec namespace issue intilelang/tools/bench.py. Once that core issue is fixed, this can be simplified to the standardif __name__ == "__main__":pattern.examples/dynamic_shape/bench_example_dynamic.py (1)
1-10: LGTM! Standard benchmark wrapper implementation.Follows the established pattern correctly. The
globals().get("__name__")workaround is appropriate given the current state oftilelang/tools/bench.py.examples/gemv/bench_example_gemv.py (1)
1-10: LGTM! Consistent with other benchmark wrappers.Implementation is correct and follows the standard pattern.
examples/fusedmoe/bench_example_fusedmoe.py (1)
1-10: LGTM! Benchmark wrapper implemented correctly.Follows the established benchmarking pattern consistently.
examples/convolution/bench_example_convolution.py (1)
1-15: LGTM! Benchmark wrapper with multiple entry points.Both benchmark functions are correctly implemented. The
globals().get("__name__")guard is appropriate given the current bench.py implementation.examples/deepseek_mla/bench_example_mla_decode.py (1)
1-10: LGTM! Clean benchmark wrapper implementation.Correctly implements the benchmarking pattern with appropriate imports and guard.
examples/flash_decoding/bench_example_flash_decoding.py (1)
1-15: LGTM! Multiple benchmarks properly registered.Both
bench_example_gqa_decodeandbench_example_mha_inferenceare correctly implemented following the standard pattern.examples/gemm_fp8/bench_example_gemm_fp8.py (1)
1-20: LGTM! Consistent benchmark wrapper implementation.The file correctly integrates three FP8 GEMM examples with the benchmarking framework. Each wrapper function properly delegates to the corresponding module's
mainentry point, and the script guard uses the correctglobals().get("__name__")pattern for compatibility with the exec-based discovery inbench.main().examples/warp_specialize/bench_example_warp_specialize.py (1)
1-25: LGTM! Clean benchmark wrapper for warp specialization examples.The file correctly wraps four warp-specialized GEMM examples with the benchmarking harness. The implementation is consistent with the framework pattern and the entry point guard is correct.
examples/attention_sink/bench_example_attention_sink.py (1)
1-52: LGTM! Comprehensive attention sink benchmarking suite.The file correctly wraps multiple MHA/GQA attention sink variants, including sliding-window configurations. The use of
window_size=128as a keyword argument toprocess_funcis valid and provides useful benchmark variations. The implementation is consistent with the framework design.examples/flash_attention/bench_example_flash_attention.py (1)
1-70: LGTM! Extensive flash attention benchmark coverage.The file provides comprehensive benchmark wrappers for 13 flash attention variants covering GQA/MHA, forward/backward passes, different tensor layouts, WGMMA pipelining, and variable-length sequences. The implementation is consistent and correct.
examples/blocksparse_gemm/bench_example_blocksparse_gemm.py (1)
1-10: LGTM! Simple and correct benchmark wrapper.The file correctly wraps the block-sparse GEMM example with the benchmarking framework. The minimal implementation is appropriate for a single benchmark target.
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
1-35: LGTM! Block-sparse attention benchmarks correctly integrated.The file correctly wraps six block-sparse attention variants (Triton and TileLang implementations) with the benchmarking framework. The implementation follows the established pattern and is functionally correct. The
bench.process_funcexception handling will gracefully handle any GPU availability issues.
| try: | ||
| with suppress_output(): | ||
| for _ in range(warmup): | ||
| func(*args, **kwargs) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Silently ignoring warmup failures may hide configuration issues.
The try-except-pass during warmup suppresses all exceptions, making it impossible to diagnose setup problems (missing dependencies, incorrect arguments, environment issues). If warmup fails, the actual timing runs will likely fail too, but the root cause will be unclear.
Consider logging warmup failures or at least emitting a warning:
try:
with suppress_output():
for _ in range(warmup):
func(*args, **kwargs)
- except Exception:
- pass
+ except Exception as e:
+ warnings.warn(
+ f"Warmup failed for {func.__module__}: {type(e).__name__}",
+ RuntimeWarning,
+ stacklevel=2,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| with suppress_output(): | |
| for _ in range(warmup): | |
| func(*args, **kwargs) | |
| except Exception: | |
| pass | |
| try: | |
| with suppress_output(): | |
| for _ in range(warmup): | |
| func(*args, **kwargs) | |
| except Exception as e: | |
| warnings.warn( | |
| f"Warmup failed for {func.__module__}: {type(e).__name__}", | |
| RuntimeWarning, | |
| stacklevel=2, | |
| ) |
🧰 Tools
🪛 Ruff (0.14.4)
36-37: try-except-pass detected, consider logging the exception
(S110)
36-36: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In tilelang/tools/bench.py around lines 32 to 37, the warmup loop currently
swallows all exceptions with a bare except/pass which hides configuration/setup
errors; change it to catch Exception as e and log a warning or error that
includes the exception details (use the module logger or Python's logging with
exc_info=True or logger.exception) so failures during warmup are visible, but
keep the suppressed output context if desired; do not remove the warmup loop —
just replace the silent pass with a logged message that includes the exception
and contextual info (function name, args) so the root cause can be diagnosed.
tilelang/tools/bench.py
Outdated
| def analyze_records(records): | ||
| # Analyze the data and draw a chart | ||
| records.sort(key=lambda x: x[1]) | ||
| name_col_width = max(len(r[0]) for r in records) | ||
| safe_width = name_col_width + 20 | ||
| print("=" * safe_width) | ||
| print(f"{'Function':<{name_col_width}} | Avg Latency (ms)") | ||
| print("-" * safe_width) | ||
| for name, lat in records: | ||
| print(f"{name:<{name_col_width}} | {lat:>10.4f}") | ||
| print("=" * safe_width) | ||
|
|
||
| names = [r[0] for r in records] | ||
| lats = [r[1] for r in records] | ||
| plt.figure(figsize=(max(len(names) * 2.2, 6), 6)) | ||
| plt.bar(names, lats) | ||
| plt.xlabel("Latency (ms)") | ||
| plt.title("Benchmark Results") | ||
|
|
||
| test_file = inspect.getsourcefile(sys._getframe(2)) | ||
| out_dir = os.path.dirname(test_file) | ||
| out_path = os.path.join(out_dir, "bench_result.png") | ||
|
|
||
| plt.tight_layout() | ||
| plt.savefig(out_path, dpi=200) | ||
| plt.close() | ||
|
|
||
| print(f"Saved Bar chart to {out_path}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
analyze_records will crash if no benchmarks are recorded.
Line 74 computes max(len(r[0]) for r in records) without checking if records is empty, causing a ValueError when all benchmarks fail or when called with an empty list.
Apply this diff to handle empty records gracefully:
def analyze_records(records):
# Analyze the data and draw a chart
+ if not records:
+ warnings.warn("No benchmark records to analyze", RuntimeWarning, stacklevel=2)
+ return
records.sort(key=lambda x: x[1])
name_col_width = max(len(r[0]) for r in records)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def analyze_records(records): | |
| # Analyze the data and draw a chart | |
| records.sort(key=lambda x: x[1]) | |
| name_col_width = max(len(r[0]) for r in records) | |
| safe_width = name_col_width + 20 | |
| print("=" * safe_width) | |
| print(f"{'Function':<{name_col_width}} | Avg Latency (ms)") | |
| print("-" * safe_width) | |
| for name, lat in records: | |
| print(f"{name:<{name_col_width}} | {lat:>10.4f}") | |
| print("=" * safe_width) | |
| names = [r[0] for r in records] | |
| lats = [r[1] for r in records] | |
| plt.figure(figsize=(max(len(names) * 2.2, 6), 6)) | |
| plt.bar(names, lats) | |
| plt.xlabel("Latency (ms)") | |
| plt.title("Benchmark Results") | |
| test_file = inspect.getsourcefile(sys._getframe(2)) | |
| out_dir = os.path.dirname(test_file) | |
| out_path = os.path.join(out_dir, "bench_result.png") | |
| plt.tight_layout() | |
| plt.savefig(out_path, dpi=200) | |
| plt.close() | |
| print(f"Saved Bar chart to {out_path}") | |
| def analyze_records(records): | |
| # Analyze the data and draw a chart | |
| if not records: | |
| warnings.warn("No benchmark records to analyze", RuntimeWarning, stacklevel=2) | |
| return | |
| records.sort(key=lambda x: x[1]) | |
| name_col_width = max(len(r[0]) for r in records) | |
| safe_width = name_col_width + 20 | |
| print("=" * safe_width) | |
| print(f"{'Function':<{name_col_width}} | Avg Latency (ms)") | |
| print("-" * safe_width) | |
| for name, lat in records: | |
| print(f"{name:<{name_col_width}} | {lat:>10.4f}") | |
| print("=" * safe_width) | |
| names = [r[0] for r in records] | |
| lats = [r[1] for r in records] | |
| plt.figure(figsize=(max(len(names) * 2.2, 6), 6)) | |
| plt.bar(names, lats) | |
| plt.xlabel("Latency (ms)") | |
| plt.title("Benchmark Results") | |
| test_file = inspect.getsourcefile(sys._getframe(2)) | |
| out_dir = os.path.dirname(test_file) | |
| out_path = os.path.join(out_dir, "bench_result.png") | |
| plt.tight_layout() | |
| plt.savefig(out_path, dpi=200) | |
| plt.close() | |
| print(f"Saved Bar chart to {out_path}") |
🤖 Prompt for AI Agents
In tilelang/tools/bench.py around lines 71 to 98, analyze_records assumes
records is non-empty (uses max(...) and indexing) which raises ValueError for
empty lists; add an early guard at the top like "if not records: print a clear
message (e.g., 'No benchmark records to analyze'), and return" so the function
exits before computing name_col_width, creating plots, or trying to save a file;
keep the rest of the function unchanged so normal behavior proceeds when records
are present.
|
/perf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (4)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
1-11: Add runtime CUDA checks before running sparse_tensorcore benchmarkThis benchmark unconditionally calls
tilelang_example_sparse_tensorcore.main()viaprocess_func, so on machines without CUDA or with insufficient compute capability it will fail with a noisy traceback instead of a clear skip.Consider adding explicit runtime validation before invoking
process_func, e.g.:-import tilelang.tools.bench -import tilelang -import tilelang_example_sparse_tensorcore +import tilelang.tools.bench +import tilelang +import tilelang_example_sparse_tensorcore +import torch @@ -def bench_example_sparse_tensorcore(): - tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main) +def bench_example_sparse_tensorcore(): + if not torch.cuda.is_available(): + print(f"SKIPPED: {__name__}.bench_example_sparse_tensorcore requires CUDA") + return + major, _ = torch.cuda.get_device_capability() + if major < 9: + print( + f"SKIPPED: {__name__}.bench_example_sparse_tensorcore " + "requires compute capability >= 9.0" + ) + return + tilelang.tools.bench.process_func(tilelang_example_sparse_tensorcore.main)This keeps the benchmark harness simple while giving clear feedback when hardware requirements aren’t met.
tilelang/tools/bench.py (3)
33-40: Silent warmup failures can hide configuration issues.The warmup phase suppresses all exceptions without logging. If a benchmark has setup problems (missing dependencies, incorrect arguments, environment issues), users won't know why subsequent runs fail.
This issue was previously flagged in past reviews.
74-76: Missing guard for empty records.Line 76 sorts
recordswithout checking if it's empty. If all benchmarks fail or no benchmarks are registered, subsequent operations (especiallymax(len(r[0]) for r in records)if it were added, or chart generation) could fail.This issue was previously flagged in past reviews.
96-108: Fragile exec namespace relies on bench script workarounds.Line 100 initializes
module = {}for the exec namespace without populating standard module globals like__name__or__file__. This forces all bench scripts to useglobals().get("__name__")instead of the natural__name__idiom. While the workaround functions, fixing the namespace here would make bench scripts more maintainable.This issue was previously flagged in past reviews.
🧹 Nitpick comments (1)
examples/flash_attention/bench_example_flash_attention.py (1)
17-99: Consider adding docstrings for better maintainability.Each benchmark function could benefit from a brief docstring explaining:
- What example it benchmarks
- Why specific parameter values were chosen (e.g., for fair comparison or specific hardware constraints)
- Any notable differences from related benchmarks
Example:
def bench_example_gqa_fwd_bshd(): """Benchmark GQA forward pass with batch-sequence-head-dimension layout. Uses batch=1, seq_len=1024, heads=16 for comparison with pipelined variant. """ tilelang.tools.bench.process_func( example_gqa_fwd_bshd.main, batch=1, heads=16, seq_len=1024, dim=128, is_causal=False, groups=16, tune=False)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (31)
.github/workflows/pr-perfbench-bot.yml(1 hunks)examples/analyze/bench_example_analyze.py(1 hunks)examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/blocksparse_attention/bench_example_blocksparse_attention.py(1 hunks)examples/blocksparse_gemm/bench_example_blocksparse_gemm.py(1 hunks)examples/cast/bench_example_cast.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py(1 hunks)examples/deepseek_mla/bench_example_mla_decode.py(1 hunks)examples/deepseek_nsa/bench_example_tilelang_nsa.py(1 hunks)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py(1 hunks)examples/dequantize_gemm/bench_example_dequantize_gemm.py(1 hunks)examples/dynamic_shape/bench_example_dynamic.py(1 hunks)examples/elementwise/bench_example_elementwise.py(1 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)examples/flash_decoding/bench_example_flash_decoding.py(1 hunks)examples/fusedmoe/bench_example_fusedmoe.py(1 hunks)examples/gemm/bench_example_gemm.py(1 hunks)examples/gemm_fp8/bench_example_gemm_fp8.py(1 hunks)examples/gemm_splitk/bench_example_gemm_splitk.py(1 hunks)examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py(1 hunks)examples/gemv/bench_example_gemv.py(1 hunks)examples/linear_attention/bench_linear_attn.py(1 hunks)examples/minference/bench_vs_sparse_attn.py(1 hunks)examples/seer_attention/bench_block_sparse_attn_tilelang.py(1 hunks)examples/sparse_tensorcore/bench_example_sparse_tensorcore.py(1 hunks)examples/topk/bench_topk_tilelang.py(1 hunks)examples/warp_specialize/bench_example_warp_specialize.py(1 hunks)log.txt(1 hunks)maint/scripts/ci_performance.py(1 hunks)tilelang/tools/bench.py(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- log.txt
🚧 Files skipped from review as they are similar to previous changes (16)
- examples/flash_decoding/bench_example_flash_decoding.py
- examples/analyze/bench_example_analyze.py
- examples/blocksparse_attention/bench_example_blocksparse_attention.py
- examples/convolution/bench_example_convolution.py
- examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
- examples/gemv/bench_example_gemv.py
- examples/fusedmoe/bench_example_fusedmoe.py
- examples/warp_specialize/bench_example_warp_specialize.py
- examples/deepseek_nsa/bench_example_tilelang_nsa.py
- examples/gemm/bench_example_gemm.py
- examples/elementwise/bench_example_elementwise.py
- examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
- examples/topk/bench_topk_tilelang.py
- examples/gemm_splitk/bench_example_gemm_splitk.py
- examples/seer_attention/bench_block_sparse_attn_tilelang.py
- examples/blocksparse_gemm/bench_example_blocksparse_gemm.py
🧰 Additional context used
🧬 Code graph analysis (12)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/deepseek_mla/bench_example_mla_decode.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/minference/bench_vs_sparse_attn.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/dynamic_shape/bench_example_dynamic.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
maint/scripts/ci_performance.py (2)
tilelang/env.py (1)
disable_cache(275-276)maint/scripts/performance.py (1)
run(22-69)
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/attention_sink/bench_example_attention_sink.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/cast/bench_example_cast.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/linear_attention/bench_linear_attn.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/flash_attention/bench_example_flash_attention.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
🪛 Ruff (0.14.4)
tilelang/tools/bench.py
39-40: try-except-pass detected, consider logging the exception
(S110)
39-39: Do not catch blind exception: Exception
(BLE001)
51-51: Do not catch blind exception: Exception
(BLE001)
102-102: Use of exec detected
(S102)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Quick Lint
🔇 Additional comments (14)
examples/deepseek_mla/bench_example_mla_decode.py (1)
1-10: Wrapper is consistent with bench harnessThis follows the shared pattern (bench_* wrapper + bench.main entrypoint) and looks correct. Just ensure
example_mla_decode.maindoesn’t require additional arguments (e.g.,argv) beyond whatprocess_funcsupplies.examples/minference/bench_vs_sparse_attn.py (1)
1-10: Bench wrapper looks correct and consistentThe wrapper correctly forwards to
tilelang.tools.bench.process_funcwithargv=[]and wires upbench.main()as a script entrypoint. This is consistent with the pattern used by other example benchmarks.maint/scripts/ci_performance.py (1)
4-7: Handle key mismatches and empty results more robustlyA couple of edge cases can break this script and the CI job:
Assumes identical key sets between v1 and v2.
- Lines 34–36 iterate
data_v1.keys()and directly accessdata_v2[key], which will raiseKeyErrorif a benchmark exists in only one output (e.g., new bench added in PR, or one removed from main). Since the two subprocess calls run different Python environments, their outputs can diverge.Division by zero when v2 latency is zero.
- Line 35 computes
data_v1[key] / data_v2[key]with no guard; ifdata_v2[key] == 0, this raisesZeroDivisionError.Figure width can be zero when there are no valid rows.
- If parsing yields no entries (e.g., bench_all fails or format changes),
tableis empty ⇒dfempty ⇒fig_width = max(0, len(df) * 0.35)gives 0 (Line 48), which is invalid forplt.figure(figsize=(fig_width, 8)).Suggested changes
- Use the intersection of keys and guard against missing or zero entries:
table = [] -for key in data_v1.keys(): +common_keys = sorted(set(data_v1.keys()) & set(data_v2.keys())) +for key in common_keys: + if data_v2[key] == 0: + continue # avoid division by zero speedup = data_v1[key] / data_v2[key] table.append([key, data_v1[key], data_v2[key], speedup])
- Short-circuit when
tableis empty:with open("bench.md", "w") as f: f.write( tabulate(table, headers=headers, tablefmt="github", stralign="left", numalign="decimal")) f.write("\n") +if not table: + raise RuntimeError("No benchmark data parsed from bench_all output") + df = pd.DataFrame(table, columns=headers) df = df.sort_values("Speedup", ascending=False).reset_index(drop=True) -fig_width = max(0, len(df) * 0.35) +fig_width = max(4, len(df) * 0.35)Likely an incorrect or invalid review comment.
examples/dynamic_shape/bench_example_dynamic.py (1)
1-10: LGTM! Standard benchmark wrapper pattern.The file correctly follows the established benchmarking pattern: imports the bench framework, defines a
bench_*function delegating toprocess_func, and uses theglobals().get("__name__")guard for the main entry point.examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
1-10: LGTM! Clean benchmark wrapper.The implementation follows the standard pattern correctly. Note that the past review comment about missing
tilelang.testingimport appears to no longer apply, as the current code does not use any decorators.examples/cast/bench_example_cast.py (1)
1-21: LGTM! Benchmark wrappers are well-structured.Both benchmark functions properly delegate to their respective example modules with appropriate arguments. The implementation is clean and follows the established pattern.
examples/linear_attention/bench_linear_attn.py (1)
1-15: LGTM! Forward and backward benchmark wrappers.The file correctly implements benchmark wrappers for both forward and backward linear attention examples. The past review concern about missing imports does not apply to the current code.
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
1-20: LGTM! Comprehensive FP8 GEMM benchmark coverage.The file provides benchmark wrappers for three FP8 GEMM variants (2xAcc, intrinsic, and standard), properly integrating them with the benchmarking framework.
examples/dequantize_gemm/bench_example_dequantize_gemm.py (1)
1-35: LGTM! Extensive dequantize GEMM benchmark suite.The file properly implements benchmark wrappers for six different dequantize GEMM variants, covering FP4, MxFP4, W4A8, and TMA variants. The implementation is consistent and correct.
tilelang/tools/bench.py (1)
111-144: LGTM! The bench_all() discovery mechanism is well-designed.The function properly walks the examples directory, dynamically loads modules, discovers
bench_*functions, and executes them. The sys.path management ensures imports resolve correctly.examples/attention_sink/bench_example_attention_sink.py (1)
1-52: LGTM! Comprehensive attention sink benchmark suite.The file provides excellent coverage of attention sink benchmarks, including MHA/GQA variants, forward/backward passes, different implementations (wgmma_pipelined), and sliding window configurations. The implementation correctly uses the benchmarking framework and the
globals().get("__name__")pattern.examples/flash_attention/bench_example_flash_attention.py (3)
1-14: LGTM! Imports are clean and complete.All required modules are imported and match the benchmark functions defined below.
56-77: Note: Identical parameters used for pipelined vs non-pipelined GQA benchmarks.Both
bench_example_gqa_fwd_bshd_wgmma_pipelinedandbench_example_gqa_fwd_bshduse identical parameters (batch=1, heads=16, seq_len=1024, dim=128, is_causal=False, groups=16, tune=False). This is likely intentional for fair comparison between implementations, but please confirm this is the desired behavior.
102-103: LGTM! Main guard correctly handles exec context.The
globals().get("__name__")pattern properly handles both direct script execution and the benchmark harness's exec-based replay mechanism.
| - name: Read markdown table | ||
| id: read_md | ||
| run: | | ||
| echo "content<<EOF" >> $GITHUB_OUTPUT | ||
| cat bench.md >> $GITHUB_OUTPUT | ||
| echo "EOF" >> $GITHUB_OUTPUT | ||
| - name: Upload PNG to GitHub and get URL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Image upload step will not produce a valid embeddable URL
There are a couple of correctness issues in the new PNG upload/comment flow that will break the embedded image:
-
Filename mismatch between tree entry and URL (definite bug).
- Tree entry path:
perf_plot_${context.runId}.png(Line 101). - Raw URL path:
bench_${context.runId}.png(Line 108).
The URL points to a file that does not exist in the created tree, so the image link in the PR comment will 404.
- Tree entry path:
-
Insufficient token permissions for git APIs.
- Top-level workflow permissions set
contents: readonly, butgit.createBlob/git.createTreeneed write-level repo contents permissions. As written, the upload step will likely fail with a 403.
- Top-level workflow permissions set
-
ci_performance.py is executed from
main, not from the PR merge commit.- You
checkoutrefs/pull/.../merge, then install the merged version, but latergit checkout mainand never switch back before runningpython maint/scripts/ci_performance.py. That means the script version used at runtime is the one frommain, not the PR. The new bench.md/bench.png generation introduced in this PR won’t be exercised for open-PR/perfruns.
- You
Suggested fixes
- Make the tree path and URL consistent (choose one naming scheme). For example:
- const tree = await github.rest.git.createTree({
+ const tree = await github.rest.git.createTree({
owner: context.repo.owner,
repo: context.repo.repo,
tree: [{
- path: `perf_plot_${context.runId}.png`,
+ path: `bench_${context.runId}.png`,
mode: '100644',
type: 'blob',
sha: blob.data.sha
}]
});
- const url = `https://raw.githubusercontent.com/${context.repo.owner}/${context.repo.repo}/${tree.data.sha}/bench_${context.runId}.png`
+ const url = `https://raw.githubusercontent.com/${context.repo.owner}/${context.repo.repo}/${tree.data.sha}/bench_${context.runId}.png`- Relax or adjust workflow permissions so the job can create blobs/trees, e.g.:
permissions:
- contents: read
+ contents: write- Ensure
ci_performance.pyruns from the PR merge commit so the new bench.md/bench.png logic is used. For instance:- Move
git checkout mainearlier and then re-checkoutrefs/pull/.../mergebefore the “Run performance test” step, or - Run the script with
./tll/bin/pythonand import it from the installed package instead of the working tree, or - After “Install original version”, explicitly
git checkout refs/pull/${{ github.event.issue.number }}/mergebefore step “Run performance test”.
- Move
Also applies to: 81-109, 116-128
🏁 Script executed:
#!/bin/bash
# Locate and display the workflow file with line numbers
if [ -f ".github/workflows/pr-perfbench-bot.yml" ]; then
echo "=== Workflow file found ==="
cat -n ".github/workflows/pr-perfbench-bot.yml" | head -150
else
echo "Workflow file not found at expected path"
find . -name "pr-perfbench-bot.yml" -type f 2>/dev/null
fiLength of output: 5204
🏁 Script executed:
#!/bin/bash
# Search for git.createBlob and git.createTree calls to understand the upload step
rg -n "createBlob|createTree|perf_plot|bench_\$\{context" ".github/workflows/" -A 2 -B 2Length of output: 1189
🏁 Script executed:
#!/bin/bash
# Check for permissions block and git checkout sequences
rg -n "permissions:|git checkout|ci_performance.py" ".github/workflows/pr-perfbench-bot.yml" -B 1 -A 1Length of output: 273
Image upload step will not produce a valid embeddable URL and will fail with permissions error
All three correctness issues in the new PNG upload/comment flow are confirmed and must be fixed:
-
Filename mismatch between tree entry and URL (confirmed bug).
- Line 101: Tree entry path uses
perf_plot_${context.runId}.png - Line 108: Raw URL references
bench_${context.runId}.png - The URL points to a file that does not exist in the created tree, causing a 404 on the embedded image.
- Line 101: Tree entry path uses
-
Insufficient token permissions for git APIs (confirmed bug).
- Line 8-9: Workflow permissions set
contents: readonly - Lines 90 and 97: Calls to
github.rest.git.createBlobandgithub.rest.git.createTreerequire write-level permissions - The upload step will fail with a 403 error.
- Line 8-9: Workflow permissions set
-
ci_performance.py executes from
main, not the PR merge commit (confirmed bug).- Line 64:
git checkout mainswitches to main branch - Lines 73-74: Script runs from the main branch working directory via the
tlvenv - The new bench.md/bench.png generation logic introduced in this PR will not be exercised for
/perfruns.
- Line 64:
Required fixes:
-
Make the tree path and URL consistent (both should use
bench_prefix):- path: `perf_plot_${context.runId}.png`, + path: `bench_${context.runId}.png`,
-
Upgrade workflow permissions to allow blob/tree creation:
permissions: - contents: read + contents: write
-
Ensure
ci_performance.pyruns from the PR merge commit: after line 68, addgit checkout refs/pull/${{ github.event.issue.number }}/mergebefore the "Run performance test" step.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
.github/workflows/pr-perfbench-bot.yml lines 75-81 and surrounding steps: the
image upload flow has three confirmed issues — the tree entry filename and the
constructed raw URL are inconsistent (use perf_plot_ vs bench_), the workflow
permissions only grant contents: read while createBlob/createTree need write,
and the perf script is run from main instead of the PR merge commit; to fix,
make the filename used when creating the blob/tree and the URL construction use
the same bench_${{ github.run_id }} (or context.runId) prefix, update workflow
permissions to grant contents: write for the job (or at least for the step that
calls git APIs), and before the "Run performance test" step add a checkout to
refs/pull/${{ github.event.issue.number }}/merge so ci_performance.py runs
against the PR merge commit.
| def analyze_records(records, out_dir): | ||
| # Analyze the data and draw a chart | ||
| records.sort(key=lambda x: x[1]) | ||
| headers = ["Functions", "Avg Latency (ms)"] | ||
| print( | ||
| tabulate(_RECORDS, headers=headers, tablefmt="github", stralign="left", numalign="decimal")) | ||
|
|
||
| names = [r[0] for r in records] | ||
| lats = [r[1] for r in records] | ||
| plt.figure(figsize=(max(len(names) * 2.2, 6), 6)) | ||
| plt.bar(names, lats) | ||
| plt.xlabel("Latency (ms)") | ||
| plt.title("Benchmark Results") | ||
| out_path = os.path.join(out_dir, "bench_result.png") | ||
|
|
||
| plt.tight_layout() | ||
| plt.savefig(out_path, dpi=200) | ||
| plt.close() | ||
|
|
||
| print(f"Saved Bar chart to {out_path}") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistent use of parameter vs global variable.
The function receives records as a parameter and sorts it (Line 76), but Line 79 prints _RECORDS (the global) instead of the records parameter. Lines 81-82 correctly use records. This inconsistency could cause the printed table to differ from the chart when records ≠ _RECORDS.
Apply this diff to use the parameter consistently:
records.sort(key=lambda x: x[1])
headers = ["Functions", "Avg Latency (ms)"]
print(
- tabulate(_RECORDS, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))
+ tabulate(records, headers=headers, tablefmt="github", stralign="left", numalign="decimal"))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def analyze_records(records, out_dir): | |
| # Analyze the data and draw a chart | |
| records.sort(key=lambda x: x[1]) | |
| headers = ["Functions", "Avg Latency (ms)"] | |
| print( | |
| tabulate(_RECORDS, headers=headers, tablefmt="github", stralign="left", numalign="decimal")) | |
| names = [r[0] for r in records] | |
| lats = [r[1] for r in records] | |
| plt.figure(figsize=(max(len(names) * 2.2, 6), 6)) | |
| plt.bar(names, lats) | |
| plt.xlabel("Latency (ms)") | |
| plt.title("Benchmark Results") | |
| out_path = os.path.join(out_dir, "bench_result.png") | |
| plt.tight_layout() | |
| plt.savefig(out_path, dpi=200) | |
| plt.close() | |
| print(f"Saved Bar chart to {out_path}") | |
| def analyze_records(records, out_dir): | |
| # Analyze the data and draw a chart | |
| records.sort(key=lambda x: x[1]) | |
| headers = ["Functions", "Avg Latency (ms)"] | |
| print( | |
| tabulate(records, headers=headers, tablefmt="github", stralign="left", numalign="decimal")) | |
| names = [r[0] for r in records] | |
| lats = [r[1] for r in records] | |
| plt.figure(figsize=(max(len(names) * 2.2, 6), 6)) | |
| plt.bar(names, lats) | |
| plt.xlabel("Latency (ms)") | |
| plt.title("Benchmark Results") | |
| out_path = os.path.join(out_dir, "bench_result.png") | |
| plt.tight_layout() | |
| plt.savefig(out_path, dpi=200) | |
| plt.close() | |
| print(f"Saved Bar chart to {out_path}") |
🤖 Prompt for AI Agents
In tilelang/tools/bench.py around lines 74 to 94, the tabulate call prints the
global _RECORDS instead of the function parameter records, causing a mismatch
between the printed table and the plotted chart; update the tabulate call to use
the local records variable (i.e., replace _RECORDS with records) so the printed
table and chart reflect the same input, keeping the existing sort on records and
other logic unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/pr-perfbench-bot.yml (1)
59-74: Run ci_performance.py from the PR branch venv, not the main branch venvThis workflow will fail because it runs
ci_performance.pyfrom the main branch (line 64:git checkout main, line 73:source tl/bin/activate), but the main branch version doesn't generatebench.mdorbench.png. The updatedci_performance.pyin the PR branch callstilelang.tools.bench.bench_all()and writes these files, whereas the main branch version runsperformance.pyand only prints metrics. When the workflow reaches line 75 ("Read markdown table"), thebench.mdfile won't exist, causing the workflow to fail.Change line 73 from
source tl/bin/activatetosource tll/bin/activateso the performance script runs from the PR branch installation, which includes the updated benchmarking code.
🧹 Nitpick comments (1)
examples/flash_attention/bench_example_flash_attention.py (1)
1-108: Consider adding a module-level docstring.This file serves as a benchmark aggregator for flash attention examples. Adding a brief module-level docstring would improve maintainability and help developers understand its purpose, especially since this is part of a broader benchmarking framework across the project.
For example:
"""Benchmark aggregator for flash attention examples. This module registers benchmark functions for various flash attention implementations (GQA, MHA, forward, backward, etc.) that can be executed via tilelang.tools.bench.main(). """ import tilelang.tools.bench # ... rest of imports
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.github/workflows/pr-perfbench-bot.yml(2 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.
Applied to files:
.github/workflows/pr-perfbench-bot.yml
🧬 Code graph analysis (1)
examples/flash_attention/bench_example_flash_attention.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
🔇 Additional comments (3)
.github/workflows/pr-perfbench-bot.yml (2)
8-9: ✓ Permissions upgraded correctlyLine 9 now grants
contents: write, which is necessary forgithub.rest.git.createBlobandgithub.rest.git.createTreecalls in the "Upload PNG" step (lines 90, 97). This resolves the earlier 403 error issue.
101-101: ✓ Filename consistency fixedBoth the tree entry path (line 101) and the raw URL construction (line 108) now use the same
bench_${context.runId}.pngprefix, resolving the earlier 404 mismatch issue.Also applies to: 108-108
examples/flash_attention/bench_example_flash_attention.py (1)
1-14: All imported example modules exist and are correctly referenced.Verification confirms that all 13 imported example modules exist in the
examples/flash_attentiondirectory and are correctly imported in the file. The import statements are valid and will resolve properly at runtime.
| def bench_example_mha_bwd_bshd(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_bwd_bshd.main, | ||
| BATCH=1, | ||
| H=16, | ||
| N_CTX=512, | ||
| D_HEAD=64, | ||
| causal=False, | ||
| ) | ||
|
|
||
|
|
||
| def bench_example_mha_bwd_bhsd(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_bwd_bhsd.main, | ||
| BATCH=1, | ||
| H=16, | ||
| N_CTX=512, | ||
| D_HEAD=64, | ||
| causal=False, | ||
| ) | ||
|
|
||
|
|
||
| def bench_example_mha_bwd_bshd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_bwd_bshd_wgmma_pipelined.main, | ||
| BATCH=1, | ||
| H=32, | ||
| N_CTX=256, | ||
| D_HEAD=64, | ||
| causal=False) | ||
|
|
||
|
|
||
| def bench_example_gqa_fwd_bshd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func( | ||
| example_gqa_fwd_bshd_wgmma_pipelined.main, | ||
| batch=1, | ||
| heads=16, | ||
| seq_len=1024, | ||
| dim=128, | ||
| is_causal=False, | ||
| groups=16, | ||
| tune=False) | ||
|
|
||
|
|
||
| def bench_example_gqa_fwd_bshd(): | ||
| tilelang.tools.bench.process_func( | ||
| example_gqa_fwd_bshd.main, | ||
| batch=1, | ||
| heads=16, | ||
| seq_len=1024, | ||
| dim=128, | ||
| is_causal=False, | ||
| groups=16, | ||
| tune=False) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bhsd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func(example_mha_fwd_bhsd_wgmma_pipelined.main) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bhsd(): | ||
| tilelang.tools.bench.process_func(example_mha_fwd_bhsd.main) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bshd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_fwd_bshd_wgmma_pipelined.main, batch=1, heads=32, seq_len=256) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bshd(): | ||
| tilelang.tools.bench.process_func(example_mha_fwd_bshd.main, batch=1, seq_len=256) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_varlen(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_fwd_varlen.main, batch=4, heads=16, seq_len=512, dim=64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify parameter names match the example module signatures.
The benchmark functions use inconsistent parameter naming conventions:
- MHA backward benchmarks (lines 29-58): UPPERCASE parameters (
BATCH,H,N_CTX,D_HEAD,causal) - GQA/MHA forward benchmarks (lines 61-104): lowercase parameters (
batch,heads,seq_len,dim, etc.)
This inconsistency was flagged in a previous review. Ensure these parameter names match the actual function signatures in each example module to avoid TypeError exceptions at runtime.
Run this script to verify parameter compatibility:
🏁 Script executed:
#!/bin/bash
# Description: Extract main() function signatures from example modules to verify parameter names
echo "=== MHA Backward Example Signatures (expecting UPPERCASE) ==="
for file in example_mha_bwd_bshd.py example_mha_bwd_bhsd.py example_mha_bwd_bshd_wgmma_pipelined.py; do
echo "--- $file ---"
ast-grep --pattern 'def main($$$): $$$' examples/flash_attention/$file | head -20
done
echo -e "\n=== GQA Forward Example Signatures (expecting lowercase) ==="
for file in example_gqa_fwd_bshd.py example_gqa_fwd_bshd_wgmma_pipelined.py; do
echo "--- $file ---"
ast-grep --pattern 'def main($$$): $$$' examples/flash_attention/$file | head -20
done
echo -e "\n=== MHA Forward Example Signatures (expecting lowercase) ==="
for file in example_mha_fwd_bhsd.py example_mha_fwd_bhsd_wgmma_pipelined.py example_mha_fwd_bshd.py example_mha_fwd_bshd_wgmma_pipelined.py example_mha_fwd_varlen.py; do
echo "--- $file ---"
ast-grep --pattern 'def main($$$): $$$' examples/flash_attention/$file | head -20
doneLength of output: 21933
I need to search for the process_func() implementation to understand how it processes the keyword arguments against these tensor-based function signatures.The web search didn't provide the specific implementation of process_func. Let me search the codebase directly for it:
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for process_func implementation in the tilelang tools module
rg -n "def process_func" --type=py -A 20
Length of output: 1300
Fix parameter mismatches between benchmark calls and example function signatures—multiple functions will fail at runtime.
The benchmark functions have critical parameter mismatches that will cause TypeError exceptions when called:
- GQA forward benchmarks (lines 61-82): Passing
batch,heads,seq_len,dim,is_causal,groups,tunebut the actual signatures expect positional tensor argumentsQ,K,V,Output - MHA forward bhsd benchmarks (lines 93-99): Passing no arguments but the actual signatures expect
Q,K,V,Outputtensor arguments - MHA forward bshd benchmarks (lines 100-102): Passing
batch,seq_lenbut the actual signatures expectQ,K,V,Outputtensor arguments - MHA forward varlen benchmark (lines 104-105): Passing
batch,heads,seq_len,dimbut the actual signature expectsQ_unpad,K_unpad,V_unpad,cu_seqlens_q,cu_seqlens_k,max_seqlen_q,Output_unpad
The MHA backward benchmarks (lines 29-58) correctly match their function signatures with BATCH, H, N_CTX, D_HEAD, causal parameters.
🤖 Prompt for AI Agents
examples/flash_attention/bench_example_flash_attention.py lines 29-104: several
benchmark wrappers call example_* .main with keyword params that do not match
the example functions' signatures (GQA and MHA forward variants expect tensor
arguments like Q,K,V,Output or unpadded variants), causing runtime TypeError;
fix by replacing direct calls with small wrapper functions (or lambdas) that
construct and pass dummy tensors/sequence metadata with the correct shapes/types
expected by each example.main (e.g., for GQA/MHA forward build random
Q,K,V,Output tensors with matching batch/heads/seq_len/dim or for varlen build
Q_unpad/K_unpad/V_unpad plus cu_seqlens and max_seqlen), then pass those
wrappers into tilelang.tools.bench.process_func so the signatures align and
benchmarks run without TypeError.
|
/perf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (8)
.github/workflows/pr-perfbench-bot.yml (2)
59-74: Critical: Performance script executes from main branch, not PR merge commit.The workflow installs and runs
ci_performance.pyfrom the main branch (Lines 64, 73-74), which means the new bench.md/bench.png generation logic introduced in this PR will not be exercised during/perfcommand runs. The workflow checks out the PR merge commit (Line 38), installs it into thetllvenv (Lines 52-57), but then switches to main (Line 64) and runs the performance script from thetlvenv.To fix, ensure
ci_performance.pyruns from the PR merge commit. Add after Line 68:pip install . + + - name: Switch back to PR merge commit + run: | + git checkout refs/pull/${{ github.event.issue.number }}/merge
81-109: Filename consistency issue resolved.The tree entry path (Line 101) and the raw URL construction (Line 108) now both use
bench_${context.runId}.png, fixing the previous mismatch that would have caused 404 errors on embedded images.examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
6-7: CUDA requirement validation concern acknowledged.The benchmark delegates to a CUDA-requiring module without runtime checks. When executed via
bench.main(), pytest decorators are bypassed. However, this is a design limitation of the broader benchmarking framework that affects all CUDA-dependent benchmarks uniformly, not specific to this file.examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
1-7: Add missing CUDA decorators and import.The past review comment remains unresolved. This file still lacks
import tilelang.testingand@tilelang.testing.requires_cudadecorators on all six benchmark functions (lines 10, 14, 18, 22, 28, 41), which call GPU-exclusive kernels.Also applies to: 10-51
tilelang/tools/bench.py (4)
100-102: Critical: exec namespace breaks__name__guard in bench scripts.This past review comment remains unresolved. The empty
module = {}dict causesNameErrorwhen bench scripts containif __name__ == "__main__":guards becauseexec()won't populate__name__,__file__, or other module-level variables.
35-40: Silently ignoring warmup failures hides configuration issues.This past review comment remains unresolved. The
try-except-passduring warmup suppresses all exceptions, making it impossible to diagnose setup problems (missing dependencies, incorrect arguments, environment issues).
79-79: Inconsistent use of parameter vs global variable.This past review comment remains unresolved. Line 79 prints
_RECORDS(the global) instead of therecordsparameter, causing the printed table to differ from the chart when the parameter differs from the global.
74-93: Handle empty records gracefully.While the explicit
max(len(r[0]) for r in records)mentioned in past reviews appears to have been removed,analyze_recordsshould still guard against empty records to prevent plotting errors and provide clearer feedback.Apply this diff:
def analyze_records(records, out_dir): # Analyze the data and draw a chart + if not records: + print("No benchmark records to analyze") + return records.sort(key=lambda x: x[1])
🧹 Nitpick comments (3)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
2-2: Remove unused import.The
tilelangmodule is imported but never used in this file.import tilelang.tools.bench -import tilelang import tilelang_example_sparse_tensorcoremaint/scripts/ci_performance.py (2)
46-49: Potential edge case: zero or negative figure width.Line 48 calculates
fig_width = max(0, len(df) * 0.35), which could be 0 if the DataFrame is empty or very small. A zero-width figure would fail or produce an unusable plot.Consider setting a minimum width:
-fig_width = max(0, len(df) * 0.35) +fig_width = max(6, len(df) * 0.35) plt.figure(figsize=(fig_width, 8))
31-37: Potential KeyError if benchmark keys differ between versions.Line 34 iterates over
data_v1.keys()and directly accessesdata_v2[key]without checking if the key exists. If a benchmark is added or removed between versions, this will raise aKeyError.Add a check to handle missing keys:
table = [] for key in data_v1.keys(): + if key not in data_v2: + continue speedup = data_v1[key] / data_v2[key] table.append([key, data_v1[key], data_v2[key], speedup])Or use a symmetric approach that handles keys in either version:
table = [] all_keys = set(data_v1.keys()) | set(data_v2.keys()) for key in all_keys: if key in data_v1 and key in data_v2: speedup = data_v1[key] / data_v2[key] table.append([key, data_v1[key], data_v2[key], speedup])
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (31)
.github/workflows/pr-perfbench-bot.yml(2 hunks)examples/analyze/bench_example_analyze.py(1 hunks)examples/attention_sink/bench_example_attention_sink.py(1 hunks)examples/blocksparse_attention/bench_example_blocksparse_attention.py(1 hunks)examples/blocksparse_gemm/bench_example_blocksparse_gemm.py(1 hunks)examples/cast/bench_example_cast.py(1 hunks)examples/convolution/bench_example_convolution.py(1 hunks)examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py(1 hunks)examples/deepseek_mla/bench_example_mla_decode.py(1 hunks)examples/deepseek_nsa/bench_example_tilelang_nsa.py(1 hunks)examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py(1 hunks)examples/dequantize_gemm/bench_example_dequantize_gemm.py(1 hunks)examples/dynamic_shape/bench_example_dynamic.py(1 hunks)examples/elementwise/bench_example_elementwise.py(1 hunks)examples/flash_attention/bench_example_flash_attention.py(1 hunks)examples/flash_decoding/bench_example_flash_decoding.py(1 hunks)examples/fusedmoe/bench_example_fusedmoe.py(1 hunks)examples/gemm/bench_example_gemm.py(1 hunks)examples/gemm_fp8/bench_example_gemm_fp8.py(1 hunks)examples/gemm_splitk/bench_example_gemm_splitk.py(1 hunks)examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py(1 hunks)examples/gemv/bench_example_gemv.py(1 hunks)examples/linear_attention/bench_linear_attn.py(1 hunks)examples/minference/bench_vs_sparse_attn.py(1 hunks)examples/seer_attention/bench_block_sparse_attn_tilelang.py(1 hunks)examples/sparse_tensorcore/bench_example_sparse_tensorcore.py(1 hunks)examples/topk/bench_topk_tilelang.py(1 hunks)examples/warp_specialize/bench_example_warp_specialize.py(1 hunks)log.txt(1 hunks)maint/scripts/ci_performance.py(1 hunks)tilelang/tools/bench.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (19)
- examples/flash_decoding/bench_example_flash_decoding.py
- examples/seer_attention/bench_block_sparse_attn_tilelang.py
- examples/elementwise/bench_example_elementwise.py
- examples/linear_attention/bench_linear_attn.py
- examples/fusedmoe/bench_example_fusedmoe.py
- log.txt
- examples/blocksparse_gemm/bench_example_blocksparse_gemm.py
- examples/dequantize_gemm/bench_example_dequantize_gemm.py
- examples/gemm_streamk/bench_example_tilelang_gemm_splitk.py
- examples/deepseek_nsa/bench_example_tilelang_nsa.py
- examples/deepseek_mla/bench_example_mla_decode.py
- examples/minference/bench_vs_sparse_attn.py
- examples/deepseek_v32/bench_tilelang_example_deepseek_v32.py
- examples/convolution/bench_example_convolution.py
- examples/dynamic_shape/bench_example_dynamic.py
- examples/analyze/bench_example_analyze.py
- examples/attention_sink/bench_example_attention_sink.py
- examples/cast/bench_example_cast.py
- examples/gemm/bench_example_gemm.py
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-10T13:29:29.347Z
Learnt from: XuehaiPan
Repo: tile-ai/tilelang PR: 973
File: .github/workflows/ci.yml:13-15
Timestamp: 2025-10-10T13:29:29.347Z
Learning: In .github/workflows/ci.yml for tilelang (GitHub Actions), actions/cachev4 and setup-python’s cache feature require GITHUB_TOKEN with actions: write to save caches; with a permissions block that only sets contents: read, unspecified actions permission becomes none, so caches will restore but not save.
Applied to files:
.github/workflows/pr-perfbench-bot.yml
🧬 Code graph analysis (10)
examples/topk/bench_topk_tilelang.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/warp_specialize/bench_example_warp_specialize.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/gemv/bench_example_gemv.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
maint/scripts/ci_performance.py (2)
tilelang/env.py (1)
disable_cache(275-276)maint/scripts/performance.py (1)
run(22-69)
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/gemm_splitk/bench_example_gemm_splitk.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/sparse_tensorcore/bench_example_sparse_tensorcore.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
examples/flash_attention/bench_example_flash_attention.py (1)
tilelang/tools/bench.py (2)
process_func(33-71)main(96-108)
🪛 Ruff (0.14.5)
tilelang/tools/bench.py
39-40: try-except-pass detected, consider logging the exception
(S110)
39-39: Do not catch blind exception: Exception
(BLE001)
51-51: Do not catch blind exception: Exception
(BLE001)
102-102: Use of exec detected
(S102)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Test for Python 3.12 with ROCm-6.3 (on self-hosted-amd)
- GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
- GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
🔇 Additional comments (12)
examples/deepseek_deepgemm/bench_example_deepgemm_fp8_2xAcc.py (1)
1-10: LGTM!The benchmark wrapper follows the established pattern correctly, delegating to
tilelang.tools.bench.process_funcand using the standard main guard..github/workflows/pr-perfbench-bot.yml (1)
8-9: Permissions upgrade successfully addresses blob/tree creation requirements.The change from
contents: readtocontents: writecorrectly enables the GitHub API calls at Lines 90 and 97 (createBlob,createTree).Based on learnings.
examples/topk/bench_topk_tilelang.py (1)
1-10: LGTM!The benchmark wrapper correctly follows the established pattern with appropriate imports and delegation to the bench framework.
maint/scripts/ci_performance.py (2)
4-9: LGTM!The new imports and cache disabling integrate cleanly with the benchmarking framework.
22-29: Thebench_all()function is correctly implemented and exists intilelang.tools.bench.Verification confirms the function is defined at line 111 of
tilelang/tools/bench.pyand is properly exported for use by the CI script.examples/gemm_splitk/bench_example_gemm_splitk.py (1)
1-15: LGTM!The benchmark wrapper correctly registers two split-k GEMM variants following the established pattern.
examples/gemm_fp8/bench_example_gemm_fp8.py (1)
1-20: LGTM!The benchmark wrapper correctly registers three FP8 GEMM variants following the established pattern, with clear naming that distinguishes the 2xAcc, intrinsic, and base implementations.
examples/gemv/bench_example_gemv.py (1)
5-6: Clarify confusingdo_benchparameter logic.The
do_bench=Falseparameter is indeed unique among all benchmark scripts and is accepted byexample_gemv.main. However, the parameter logic appears inverted: the condition at line 362 (if not do_bench:) runs extended benchmarking whendo_bench=False, which contradicts the parameter name's intuitive meaning.Confirm whether this inverted logic is intentional or if the condition should be
if do_bench:instead.examples/warp_specialize/bench_example_warp_specialize.py (1)
28-29: LGTM!Correctly uses
globals().get("__name__")to avoidNameErrorwhentilelang.tools.bench.main()re-executes this file viaexec().examples/blocksparse_attention/bench_example_blocksparse_attention.py (1)
54-55: LGTM!Correctly uses
globals().get("__name__")pattern.tilelang/tools/bench.py (1)
111-144: LGTM!The
bench_all()function correctly:
- Uses
importlib.utilfor proper module loading (preserves__file__, imports)- Walks the examples directory to discover bench scripts
- Adds directories to
sys.pathbefore loading modules- Discovers and invokes all
bench_*functionsexamples/flash_attention/bench_example_flash_attention.py (1)
107-108: LGTM!Correctly uses
globals().get("__name__")pattern.
| import tilelang.tools.bench | ||
| import example_gqa_bwd | ||
| import example_gqa_bwd_tma_reduce_varlen | ||
| import example_gqa_bwd_wgmma_pipelined | ||
| import example_gqa_fwd_bshd | ||
| import example_gqa_fwd_bshd_wgmma_pipelined | ||
| import example_mha_bwd_bshd | ||
| import example_mha_bwd_bhsd | ||
| import example_mha_bwd_bshd_wgmma_pipelined | ||
| import example_mha_fwd_bhsd | ||
| import example_mha_fwd_bhsd_wgmma_pipelined | ||
| import example_mha_fwd_bshd | ||
| import example_mha_fwd_bshd_wgmma_pipelined | ||
| import example_mha_fwd_varlen |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Verify CUDA requirements for flash attention benchmarks.
All 13 benchmark functions invoke flash attention kernels (GQA and MHA variants), which typically require GPU execution. Similar to bench_example_blocksparse_attention.py, this file may need import tilelang.testing and @tilelang.testing.requires_cuda decorators on each benchmark function. Confirm whether these benchmarks can run on CPU or if decorators should be added.
Also applies to: 17-105
🏁 Script executed:
#!/bin/bash
# Check if flash attention examples have CUDA dependencies
echo "=== Checking for CUDA/GPU usage in flash attention modules ==="
for module in example_gqa_bwd example_gqa_bwd_tma_reduce_varlen example_gqa_bwd_wgmma_pipelined example_gqa_fwd_bshd example_gqa_fwd_bshd_wgmma_pipelined example_mha_bwd_bshd example_mha_bwd_bhsd example_mha_bwd_bshd_wgmma_pipelined example_mha_fwd_bhsd example_mha_fwd_bhsd_wgmma_pipelined example_mha_fwd_bshd example_mha_fwd_bshd_wgmma_pipelined example_mha_fwd_varlen; do
echo "--- Module: ${module}.py ---"
fd "${module}.py" examples/flash_attention --exec rg -nC2 "@tilelang.jit|@triton.jit|torch.*cuda|\.cuda\(\)" {} \; 2>/dev/null | head -20
doneLength of output: 4188
Now I need to examine the benchmark file directly to understand the structure and compare with similar files:
Benchmarks require CUDA decorators to prevent test failures on non-GPU systems.
All 13 benchmark functions invoke GPU-only flash attention kernels decorated with @tilelang.jit. These kernels will fail on CPU-only systems. Add import tilelang.testing and decorate each benchmark function with @tilelang.testing.requires_cuda to skip tests when CUDA is unavailable, following the pattern used in bench_example_blocksparse_attention.py.
🤖 Prompt for AI Agents
In examples/flash_attention/bench_example_flash_attention.py around lines 1 to
14, the benchmark module imports GPU-only kernels but lacks CUDA availability
checks; add "import tilelang.testing" at top and annotate each benchmark
function with "@tilelang.testing.requires_cuda" so tests are skipped on systems
without CUDA, mirroring bench_example_blocksparse_attention.py; ensure the
decorator is applied to all 13 benchmark functions that call @tilelang.jit
kernels.
| def bench_example_gqa_bwd_tma_reduce_varlen(): | ||
| tilelang.tools.bench.process_func(example_gqa_bwd_tma_reduce_varlen.main) | ||
|
|
||
|
|
||
| def bench_example_gqa_bwd(): | ||
| tilelang.tools.bench.process_func(example_gqa_bwd.main) | ||
|
|
||
|
|
||
| def bench_example_gqa_bwd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func(example_gqa_bwd_wgmma_pipelined.main) | ||
|
|
||
|
|
||
| def bench_example_mha_bwd_bshd(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_bwd_bshd.main, | ||
| BATCH=1, | ||
| H=16, | ||
| N_CTX=512, | ||
| D_HEAD=64, | ||
| causal=False, | ||
| ) | ||
|
|
||
|
|
||
| def bench_example_mha_bwd_bhsd(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_bwd_bhsd.main, | ||
| BATCH=1, | ||
| H=16, | ||
| N_CTX=512, | ||
| D_HEAD=64, | ||
| causal=False, | ||
| ) | ||
|
|
||
|
|
||
| def bench_example_mha_bwd_bshd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_bwd_bshd_wgmma_pipelined.main, | ||
| BATCH=1, | ||
| H=32, | ||
| N_CTX=256, | ||
| D_HEAD=64, | ||
| causal=False) | ||
|
|
||
|
|
||
| def bench_example_gqa_fwd_bshd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func( | ||
| example_gqa_fwd_bshd_wgmma_pipelined.main, | ||
| batch=1, | ||
| heads=16, | ||
| seq_len=1024, | ||
| dim=128, | ||
| is_causal=False, | ||
| groups=16, | ||
| tune=False) | ||
|
|
||
|
|
||
| def bench_example_gqa_fwd_bshd(): | ||
| tilelang.tools.bench.process_func( | ||
| example_gqa_fwd_bshd.main, | ||
| batch=1, | ||
| heads=16, | ||
| seq_len=1024, | ||
| dim=128, | ||
| is_causal=False, | ||
| groups=16, | ||
| tune=False) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bhsd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func(example_mha_fwd_bhsd_wgmma_pipelined.main) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bhsd(): | ||
| tilelang.tools.bench.process_func(example_mha_fwd_bhsd.main) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bshd_wgmma_pipelined(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_fwd_bshd_wgmma_pipelined.main, batch=1, heads=32, seq_len=256) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_bshd(): | ||
| tilelang.tools.bench.process_func(example_mha_fwd_bshd.main, batch=1, seq_len=256) | ||
|
|
||
|
|
||
| def bench_example_mha_fwd_varlen(): | ||
| tilelang.tools.bench.process_func( | ||
| example_mha_fwd_varlen.main, batch=4, heads=16, seq_len=512, dim=64) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify parameter compatibility with example function signatures.
Past reviews flagged critical parameter mismatches between benchmark calls and example function signatures (GQA/MHA forward variants expecting tensor arguments vs. scalar kwargs). While marked as addressed in prior commits, confirm that all benchmark function calls match their target signatures to avoid TypeError exceptions at runtime.
🏁 Script executed:
#!/bin/bash
# Extract main() signatures from flash attention examples
echo "=== Checking function signatures ==="
for module in example_gqa_bwd example_gqa_bwd_tma_reduce_varlen example_gqa_bwd_wgmma_pipelined example_gqa_fwd_bshd example_gqa_fwd_bshd_wgmma_pipelined example_mha_bwd_bshd example_mha_bwd_bhsd example_mha_bwd_bshd_wgmma_pipelined example_mha_fwd_bhsd example_mha_fwd_bhsd_wgmma_pipelined example_mha_fwd_bshd example_mha_fwd_bshd_wgmma_pipelined example_mha_fwd_varlen; do
echo "--- ${module}.py ---"
fd "${module}.py" examples/flash_attention --exec ast-grep --pattern 'def main($$$): $$$' {} \; 2>/dev/null | head -5
doneLength of output: 6391
Critical parameter mismatches in forward function benchmark calls.
Verification confirms the original review comment is valid. All forward attention function calls (lines 65–105) pass scalar keyword arguments (batch, heads, seq_len, dim, etc.) or no arguments, but their main() signatures require tensor arguments:
- GQA/MHA forward functions (lines 65–96, 103–105): expect
Q: T.Tensor(...),K: T.Tensor(...),V: T.Tensor(...),Output: T.Tensor(...)but receive scalar kwargs or no args - MHA forward varlen (lines 103–105): expects
Q_unpad: T.Tensor(...),K_unpad: T.Tensor(...),V_unpad: T.Tensor(...),cu_seqlens_q: T.Tensor([...], "int32")but receivesbatch=4, heads=16, seq_len=512, dim=64
These mismatches will cause TypeError exceptions at runtime when process_func attempts to invoke the target functions. Align all forward function calls with their tensor-based signatures.
🤖 Prompt for AI Agents
In examples/flash_attention/bench_example_flash_attention.py around lines 17 to
105, the benchmark helper calls for the forward GQA/MHA examples pass scalar
keyword args (batch, heads, seq_len, dim, etc.) or no args, but the referenced
main() functions expect Tensor inputs (Q, K, V, Output or unpadded Q/K/V and
cu_seqlens tensors). Replace the scalar kwargs with code that constructs and
passes appropriately-shaped T.Tensor inputs (matching the signatures: dense
Q/K/V/Output or unpadded tensors plus cu_seqlens) before calling process_func,
or wrap the main functions with small adapter lambdas that create those tensors
from the scalar parameters; ensure tensor dtypes and shapes match the expected
signatures to avoid TypeError at runtime.
| def bench_example_warp_specialize_gemm_barrierpipe_stage2(): | ||
| tilelang.tools.bench.process_func( | ||
| example_warp_specialize_gemm_barrierpipe_stage2.main, M=1024, N=1024, K=1024) | ||
|
|
||
|
|
||
| def bench_example_warp_specialize_gemm_copy_0_gemm_1(): | ||
| tilelang.tools.bench.process_func( | ||
| example_warp_specialize_gemm_copy_0_gemm_1.main, M=1024, N=1024, K=1024) | ||
|
|
||
|
|
||
| def bench_example_warp_specialize_gemm_copy_1_gemm_0(): | ||
| tilelang.tools.bench.process_func( | ||
| example_warp_specialize_gemm_copy_1_gemm_0.main, M=1024, N=1024, K=1024) | ||
|
|
||
|
|
||
| def bench_example_warp_specialize_gemm_softpipe_stage2(): | ||
| tilelang.tools.bench.process_func( | ||
| example_warp_specialize_gemm_softpipe_stage2.main, M=1024, N=1024, K=1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify CUDA requirements for warp-specialized GEMM benchmarks.
All four benchmark functions invoke warp-specialized GEMM kernels. Warp specialization is a CUDA-specific optimization, and similar benchmark files in this PR (e.g., bench_example_blocksparse_attention.py) were flagged for missing @tilelang.testing.requires_cuda decorators. Confirm whether these benchmarks can run on CPU or if decorators should be added.
🏁 Script executed:
#!/bin/bash
# Check if the warp specialize example modules have CUDA dependencies
echo "=== Checking for CUDA/GPU usage in warp specialize modules ==="
for module in example_warp_specialize_gemm_barrierpipe_stage2 example_warp_specialize_gemm_copy_0_gemm_1 example_warp_specialize_gemm_copy_1_gemm_0 example_warp_specialize_gemm_softpipe_stage2; do
echo "--- Module: ${module}.py ---"
fd "${module}.py" examples/warp_specialize --exec rg -n "@tilelang.jit|@triton.jit|torch.*cuda|\.cuda\(\)" {} \; 2>/dev/null || echo "Not found"
doneLength of output: 1309
---
🏁 Script executed:
```shell
#!/bin/bash
# Check if benchmark functions have CUDA decorators
echo "=== Checking benchmark functions for decorators ==="
rg -n -B2 "def bench_example_warp_specialize" examples/warp_specialize/bench_example_warp_specialize.py
Length of output: 332
---
🏁 Script executed:
```shell
#!/bin/bash
# Check blocksparse_attention benchmark file for decorator pattern
echo "=== Checking blocksparse_attention benchmark decorators ==="
rg -n -B2 "def bench_" examples/blocksparse_attention/bench_example_blocksparse_attention.py | head -30
Length of output: 464
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for @tilelang.testing.requires_cuda usage in codebase
echo "=== Searching for CUDA decorators in test/benchmark files ==="
rg -n "@tilelang.testing.requires_cuda|@tilelang.testing.requires_gpu" examples/ --type py -B1 | head -40
Length of output: 2695
---
🏁 Script executed:
```shell
#!/bin/bash
# Check warp_specialize test file for compute version requirements
echo "=== Checking warp_specialize test compute version requirements ==="
rg -n "@tilelang.testing.requires_cuda" examples/warp_specialize/test_example_warp_specialize.py -A1 | head -20
Length of output: 846
Add CUDA decorators to benchmark functions.
All four benchmark functions invoke CUDA-only warp-specialized GEMM kernels that require Ada GPU (compute capability 9.0). The corresponding test functions in test_example_warp_specialize.py have @tilelang.testing.requires_cuda and @tilelang.testing.requires_cuda_compute_version_eq(9, 0) decorators; the benchmarks must have the same decorators:
bench_example_warp_specialize_gemm_barrierpipe_stage2()(line 8)bench_example_warp_specialize_gemm_copy_0_gemm_1()(line 13)bench_example_warp_specialize_gemm_copy_1_gemm_0()(line 18)bench_example_warp_specialize_gemm_softpipe_stage2()(line 23)
🤖 Prompt for AI Agents
In examples/warp_specialize/bench_example_warp_specialize.py around lines 8–25,
the four benchmark functions call CUDA-only Ada (compute capability 9.0)
warp-specialized GEMM kernels but lack the CUDA decorators; add
@tilelang.testing.requires_cuda and
@tilelang.testing.requires_cuda_compute_version_eq(9, 0) above each of these
functions (bench_example_warp_specialize_gemm_barrierpipe_stage2,
bench_example_warp_specialize_gemm_copy_0_gemm_1,
bench_example_warp_specialize_gemm_copy_1_gemm_0,
bench_example_warp_specialize_gemm_softpipe_stage2); if tilelang.testing is not
already imported in this file, add the appropriate import so the decorators
resolve.
|
/perf |
|
/perf |
|
/perf |
Summary by CodeRabbit
New Features
Chores