Skip to content

Conversation

jiannanWang
Copy link
Contributor

@jiannanWang jiannanWang commented Sep 17, 2025

This PR serves as a starting point for benchmarking CuteDSL vs Triton to evaluate the potential benefit of CuteDSL. It mainly introduces three scripts:

  • create_cutedsl_ops.py: Creates four files implementing add, mul, abs, relu using CuteDSL.
  • create_triton_ops.py: Creates four files implementing the same ops in Triton.
  • benchmark_cutedsl_vs_triton.py: Loads the kernels implemented in both CuteDSL and Triton then benchmarks the performance of the four elementwise ops across different tensor sizes.

This PR also updates .gitignore and pyproject.toml to add dependencies for CuteDSL and the benchmark script.

Benchmark Results

Run uv run python BackendBench/scripts/benchmark_cutedsl_vs_triton.py to obtain the results

Implicitly compiled CuteDSL kernels vs Triton kernels

Initially, I benchmarked the implicitly compiled CuteDSL kernels against Triton kernels. CuteDSL's performance was significantly worse. The root cause is that while the compiled CuteDSL program is cached, the cache key requires rebuilding the IR module for comparison, which introduces substantial overhead.

========================================================================================================================
TABLE 1: CUTEDSL vs TRITON KERNEL BENCHMARK RESULTS
========================================================================================================================
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
| Shape     | Elements   | relu_cutedsl   | relu_triton   | relu_speedup   | add_cutedsl   | add_triton   | add_speedup   | mul_cutedsl   | mul_triton   | mul_speedup   | abs_cutedsl   | abs_triton   | abs_speedup   |
+===========+============+================+===============+================+===============+==============+===============+===============+==============+===============+===============+==============+===============+
| 512x512   | 262,144    | 42.817 ms      | 0.007 ms      | 0.00x          | 39.164 ms     | 0.007 ms     | 0.00x         | 38.093 ms     | 0.007 ms     | 0.00x         | 38.301 ms     | 0.007 ms     | 0.00x         |
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
| 1024x1024 | 1,048,576  | 40.839 ms      | 0.009 ms      | 0.00x          | 40.654 ms     | 0.012 ms     | 0.00x         | 60.473 ms     | 0.012 ms     | 0.00x         | 43.927 ms     | 0.010 ms     | 0.00x         |
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
| 2048x2048 | 4,194,304  | 39.190 ms      | 0.021 ms      | 0.00x          | 40.550 ms     | 0.030 ms     | 0.00x         | 41.715 ms     | 0.030 ms     | 0.00x         | 39.835 ms     | 0.021 ms     | 0.00x         |
+-----------+------------+----------------+---------------+----------------+---------------+--------------+---------------+---------------+--------------+---------------+---------------+--------------+---------------+
========================================================================================================================

2. Explicitly compiled (precompiled) CuteDSL Kernels vs Triton Kernels

Next, I benchmarked explicitly compiled (precompiled) CuteDSL kernels against Triton kernels. In this scenario, CuteDSL's performance was comparable to Triton.

========================================================================================================================
TABLE 2: PRECOMPILED CUTEDSL vs TRITON KERNEL BENCHMARK RESULTS
========================================================================================================================
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
| Shape     | Elements   | relu_precompiled   | relu_triton   | relu_speedup   | add_precompiled   | add_triton   | add_speedup   | mul_precompiled   | mul_triton   | mul_speedup   | abs_precompiled   | abs_triton   | abs_speedup   |
+===========+============+====================+===============+================+===================+==============+===============+===================+==============+===============+===================+==============+===============+
| 512x512   | 262,144    | 0.007 ms           | 0.007 ms      | 0.96x          | 0.008 ms          | 0.007 ms     | 0.97x         | 0.008 ms          | 0.007 ms     | 0.96x         | 0.007 ms          | 0.007 ms     | 0.95x         |
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
| 1024x1024 | 1,048,576  | 0.010 ms           | 0.009 ms      | 0.91x          | 0.013 ms          | 0.012 ms     | 0.92x         | 0.013 ms          | 0.012 ms     | 0.94x         | 0.011 ms          | 0.010 ms     | 0.89x         |
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
| 2048x2048 | 4,194,304  | 0.023 ms           | 0.021 ms      | 0.90x          | 0.034 ms          | 0.030 ms     | 0.86x         | 0.035 ms          | 0.030 ms     | 0.86x         | 0.024 ms          | 0.021 ms     | 0.87x         |
+-----------+------------+--------------------+---------------+----------------+-------------------+--------------+---------------+-------------------+--------------+---------------+-------------------+--------------+---------------+
========================================================================================================================

Key takeaways

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant