Adding scripts for creating and benchmarking simple elementwise ops in CuteDSL and Triton #167
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR serves as a starting point for benchmarking CuteDSL vs Triton to evaluate the potential benefit of CuteDSL. It mainly introduces three scripts:
create_cutedsl_ops.py
: Creates four files implementing add, mul, abs, relu using CuteDSL.create_triton_ops.py
: Creates four files implementing the same ops in Triton.benchmark_cutedsl_vs_triton.py
: Loads the kernels implemented in both CuteDSL and Triton then benchmarks the performance of the four elementwise ops across different tensor sizes.This PR also updates
.gitignore
andpyproject.toml
to add dependencies for CuteDSL and the benchmark script.Benchmark Results
Run
uv run python BackendBench/scripts/benchmark_cutedsl_vs_triton.py
to obtain the resultsImplicitly compiled CuteDSL kernels vs Triton kernels
Initially, I benchmarked the implicitly compiled CuteDSL kernels against Triton kernels. CuteDSL's performance was significantly worse. The root cause is that while the compiled CuteDSL program is cached, the cache key requires rebuilding the IR module for comparison, which introduces substantial overhead.
2. Explicitly compiled (precompiled) CuteDSL Kernels vs Triton Kernels
Next, I benchmarked explicitly compiled (precompiled) CuteDSL kernels against Triton kernels. In this scenario, CuteDSL's performance was comparable to Triton.
Key takeaways