Adding scripts for creating and benchmarking simple elementwise ops in CuteDSL and Triton #167

jiannanWang · 2025-09-17T23:35:02Z

This PR serves as a starting point for benchmarking CuteDSL vs Triton to evaluate the potential benefit of CuteDSL. It mainly introduces three scripts:

create_cutedsl_ops.py: Creates four files implementing add, mul, abs, relu using CuteDSL.
create_triton_ops.py: Creates four files implementing the same ops in Triton.
benchmark_cutedsl_vs_triton.py: Loads the kernels implemented in both CuteDSL and Triton then benchmarks the performance of the four elementwise ops across different tensor sizes.

This PR also updates .gitignore and pyproject.toml to add dependencies for CuteDSL and the benchmark script.

Benchmark Results

Run uv run python BackendBench/scripts/benchmark_cutedsl_vs_triton.py to obtain the results

Implicitly compiled CuteDSL kernels vs Triton kernels

Initially, I benchmarked the implicitly compiled CuteDSL kernels against Triton kernels. CuteDSL's performance was significantly worse. The root cause is that while the compiled CuteDSL program is cached, the cache key requires rebuilding the IR module for comparison, which introduces substantial overhead.

========================================================================================================================
TABLE 1: CUTEDSL vs TRITON KERNEL BENCHMARK RESULTS
========================================================================================================================
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
| Shape     | Elements   | relu_cutedsl   | relu_triton   | relu_speedup   | add_cutedsl   | add_triton   | add_speedup   | mul_cutedsl   | mul_triton   | mul_speedup   | abs_cutedsl   | abs_triton   | abs_speedup   |
+===========+============+================+===============+================+===============+==============+===============+===============+==============+===============+===============+==============+===============+
| 512x512   | 262,144    | 42.817 ms      | 0.007 ms      | 0.00x          | 39.164 ms     | 0.007 ms     | 0.00x         | 38.093 ms     | 0.007 ms     | 0.00x         | 38.301 ms     | 0.007 ms     | 0.00x         |
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
| 1024x1024 | 1,048,576  | 40.839 ms      | 0.009 ms      | 0.00x          | 40.654 ms     | 0.012 ms     | 0.00x         | 60.473 ms     | 0.012 ms     | 0.00x         | 43.927 ms     | 0.010 ms     | 0.00x         |
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
| 2048x2048 | 4,194,304  | 39.190 ms      | 0.021 ms      | 0.00x          | 40.550 ms     | 0.030 ms     | 0.00x         | 41.715 ms     | 0.030 ms     | 0.00x         | 39.835 ms     | 0.021 ms     | 0.00x         |
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
========================================================================================================================

2. Explicitly compiled (precompiled) CuteDSL Kernels vs Triton Kernels

Next, I benchmarked explicitly compiled (precompiled) CuteDSL kernels against Triton kernels. In this scenario, CuteDSL's performance was comparable to Triton.

========================================================================================================================
TABLE 2: PRECOMPILED CUTEDSL vs TRITON KERNEL BENCHMARK RESULTS
========================================================================================================================
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
| Shape     | Elements   | relu_precompiled   | relu_triton   | relu_speedup   | add_precompiled   | add_triton   | add_speedup   | mul_precompiled   | mul_triton   | mul_speedup   | abs_precompiled   | abs_triton   | abs_speedup   |
+===========+============+====================+===============+================+===================+==============+===============+===================+==============+===============+===================+==============+===============+
| 512x512   | 262,144    | 0.007 ms           | 0.007 ms      | 0.96x          | 0.008 ms          | 0.007 ms     | 0.97x         | 0.008 ms          | 0.007 ms     | 0.96x         | 0.007 ms          | 0.007 ms     | 0.95x         |
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
| 1024x1024 | 1,048,576  | 0.010 ms           | 0.009 ms      | 0.91x          | 0.013 ms          | 0.012 ms     | 0.92x         | 0.013 ms          | 0.012 ms     | 0.94x         | 0.011 ms          | 0.010 ms     | 0.89x         |
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
| 2048x2048 | 4,194,304  | 0.023 ms           | 0.021 ms      | 0.90x          | 0.034 ms          | 0.030 ms     | 0.86x         | 0.035 ms          | 0.030 ms     | 0.86x         | 0.024 ms          | 0.021 ms     | 0.87x         |
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
========================================================================================================================

Key takeaways

CuteDSL's cache implementation introduces significant performance overhead ([QST] CuteDSL Caching Overhead NVIDIA/cutlass#2643). One alternative way is to manage custom cache, like Tri's implementation: https://github.com/Dao-AILab/quack/blob/main/quack/dense_gemm_sm90.py#L2192C5-L2236C6.
Even with precompiled kernels, CuteDSL does not outperform Triton for elementwise operations. This is expected, as CuteDSL is designed to provide more control to users, enabling them to write potentially more efficient programs, rather than delivering better performance out-of-the-box.
In the next step I plan to checkout gemm where CuteDSL may offer performance improvements.

Benchmark elementwise op in cutedsl and triton

ace7a62

jiannanWang requested review from PaliC and msaroufim September 17, 2025 23:35

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding scripts for creating and benchmarking simple elementwise ops in CuteDSL and Triton #167

Adding scripts for creating and benchmarking simple elementwise ops in CuteDSL and Triton #167

Uh oh!

jiannanWang commented Sep 17, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Adding scripts for creating and benchmarking simple elementwise ops in CuteDSL and Triton #167

Are you sure you want to change the base?

Adding scripts for creating and benchmarking simple elementwise ops in CuteDSL and Triton #167

Uh oh!

Conversation

jiannanWang commented Sep 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Results

Implicitly compiled CuteDSL kernels vs Triton kernels

2. Explicitly compiled (precompiled) CuteDSL Kernels vs Triton Kernels

Key takeaways

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jiannanWang commented Sep 17, 2025 •

edited

Loading