[Carver] Introduce a tile-structure based cost model for auto tuning #70

LeiWang1999 · 2025-02-10T11:55:31Z

Carver: A Tile-Structure Based Hint Recommend Framework for Machine Learning Compilers

Carver is a lightweight framework for generating and ranking tile configurations (also known as tiling strategies, blocking schemes, or scheduling hints) for common GPU, CPU, and accelerator backends. It helps you explore efficient mappings of loops for operations such as matrix multiplication, elementwise transforms, and other reduction-oriented kernels.

Carver combines hardware architecture information, user-defined tile structures, and built-in heuristics to recommend tiling strategies (or "hints"). The recommended hints are easily adaptable to multiple backends, including TVM, triton, tilelang (or other domain-specific compilers).

Key Features

Unified Tiling Framework: Generate tile candidates for multiple backends under a unified API.
Architecture-Specific Modeling: Take into account architecture constraints (e.g., CUDA smem_cap, warp size, CPU cache structure, etc.) when generating hints.
Flexible Templates: High-level templates (like MatmulTemplate, GeneralReductionTemplate, ElementwiseTemplate) let you concisely specify kernel structures.
Extendable: Easily add support for new backends and new operation templates.

Usage Examples

Basic Usage: General Reduction Template

Once installed tilelang, you can import Carver and start creating templates:

from tilelang import carver
from tilelang.carver.arch import CUDA

# Instantiate a CUDA device object for an RTX 4090
arch = CUDA("nvidia/geforce-rtx-4090")

# Create a general reduction template for a loop nest:
# for i in Spatial(1024):
#     for j in Spatial(1024):
#         for k in Reduce(1024):
#             ...
carve_template = carver.GeneralReductionTemplate(
    structure="SSR",          
    shape=[1024, 1024, 1024], 
    dtype="float16",
).with_arch(arch)

# Generate top 20 tile candidates (aka scheduling hints)
hints = carve_template.recommend_hints(topk=20)
for hint in hints:
    print(hint)

Example Output (truncated):

{
  'block': [1, 128],
  'thread': [1, 128],
  'rstep': [64],
  ...
},
{
  'block': [2, 64],
  'thread': [2, 64],
  'rstep': [64],
  ...
},
...
{
  'block': [1, 16],
  'thread': [1, 16],
  'rstep': [512],
  'reduce_thread': [8],
  ...
}

A tile structure composed of S and R can simulate various cases. For example, structure SS represents a 2D element-wise operation, while SSR can represent a general matrix multiplication.

We can specialize more advanced templates to provide finer-grained information, such as MatmulTemplate.

Matmul Template

Carver also provides a specialized MatmulTemplate for matrix multiplication (e.g., C = A * B), automatically inferring common tiling strategies (thread blocks, warps, use of tensor cores, etc.).

from tilelang import carver
from tilelang.carver.arch import CUDA

arch = CUDA("nvidia/geforce-rtx-4090")
carve_template = carver.MatmulTemplate(
    M=1024,
    N=1024,
    K=1024,
    in_dtype="float16",
    accum_dtype="float16",
    out_dtype="float16",
).with_arch(arch)

# Retrieve the (symbolic) function describing the matmul
func = carve_template.equivalent_function()
print("Equivalent Function:\n", func)

# Generate hints
hints = carve_template.recommend_hints(topk=20)
for hint in hints:
    print(hint)

Example Output:

{
  'block': [32, 64],
  'warp': [16, 32],
  'rstep': [128],
  'use_tc': True,
  ...
},
{
  'block': [64, 32],
  'warp': [32, 16],
  'rstep': [128],
  'use_tc': True,
  ...
},
...
{
  'block': [256, 32],
  'warp': [128, 16],
  'rstep': [32],
  'use_tc': True,
  ...
}

Supported Architectures

Carver currently provides out-of-the-box support for:

CUDA: e.g., arch = CUDA("nvidia/geforce-rtx-4090")
CDNA (AMD GPU-like backends)
CPU

Adding a new architecture is as simple as implementing a new subclass of TileDevice or providing a custom target that describes:

Shared/local memory capacity
Warp (or vector) size
Cache sizes
Tensor instructions available

Below is an illustrative snippet of the CUDA backend:

class CUDA(TileDevice):
    def __init__(self, target: Union[tvm.target.Target, str]):
        ...
        self.platform = "CUDA"
        # Device constraints
        self.smem_cap = device.max_shared_memory_per_block
        self.compute_max_core = device.multi_processor_count
        self.warp_size = device.warp_size
        ...
        self.transaction_size = [32, 128]  # bytes
        self.bandwidth = [750, 12080]     # MB/s, approximate
        self.available_tensor_instructions = None

    def get_avaliable_tensorintrin_shapes(self):
        self.available_tensor_instructions = (
            TensorInstruction("mma", [16, 16]),
            TensorInstruction("wmma", [16, 16]),
        )
        return [t.shape for t in self.available_tensor_instructions]

    def __repr__(self):
        return f"CUDA({self.target})"

Adapting Hints to Other Compilers

One of Carver’s main benefits is its adaptability. Here are a examples for triton lang:

Given a Carver hint like:

{
  'block': [32, 64],
  'warp': [16, 32],
  'rstep': [128],
  'use_tc': True,
  'vectorize': {'A_reindex': 8, 'B_reindex': 8}
}

You might interpret this in Triton as:

block_m = 32, block_n = 64, block_k = 128
Potential warp usage = warp_m = 16, warp_n = 32
vectorize: load data with a vector width of 8
If use_tc is true, consider using Tensor Cores (TensorOps in Triton) if supported.

This helps quickly test multiple configurations without manually guessing.

Supported Templates

Carver abstracts common loop patterns through templates:

GeneralReductionTemplate: For general Spatial-Spatial-Reduce (SSR) structures or similar.
MatmulTemplate: For standard matrix multiplication C = A * B.
GEMVTemplate: For y = Ax or y = xA style operations.
ElementwiseTemplate: For elementwise transformations or pointwise ops.

You can also create your own specialized templates if you have unique loop structures or constraints. For instance, you might define specialized templates for convolution, flash attention, etc.

TODO Items

Flash Attention and its variants: Support search-space generation for specialized attention kernels.
Adapt to tile language: Provide ready-made scheduling calls or wrappers for tilelang to streamline end-to-end integration.

…tibility

…er comments

…tion logic

…UDA codegen

…ate CUDA type printing for better clarity

…consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

…remove unnecessary whitespace in multiple files.

…to use 'tilelang.language' for consistency

…rity

…_gemm_mma.py

…_tilelang_kernel_gemm_mma_intrinsic.py

…r result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

…emv_simt.py

… test cases

…at Peking University

…riting high-performance kernels with thread primitives

…header

… classes

…ormatting in layout and test files

…tblas

…ameter formatting

…etup

… for improved code documentation and clarity

…for improved readability

…re support

…model

LeiWang1999 added 30 commits February 3, 2025 16:53

[Enhancement] Add VectorizeLoop function and update imports for compa…

03cf5b5

…tibility

[CI][Test] Improve test cases for vectorization and fix typos in pars…

73cb739

…er comments

lint fix

6b80e0e

Fix incorrect module reference for VectorizeLoop transformation

91d91a7

Refactor vectorize_loop transformation by removing unused extent muta…

e3b1856

…tion logic

[Enhancement] Add support for FP8 data types and global barriers in C…

b6a1d81

…UDA codegen

Fix formatting in CUDA FP8 header file for consistency

6aef1f8

Refactor CI workflow to use 'tilelang_ci' virtual environment and upd…

d0dbc46

…ate CUDA type printing for better clarity

Update submodule 'tvm' to latest commit for improved functionality

bbc3cd7

Refactor execution backend references from 'dl_pack' to 'dlpack' for …

22f41e0

…consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

Refactor CUDA code for improved readability; clean up formatting and …

fffda93

…remove unnecessary whitespace in multiple files.

Refactor import statement in test_tilelang_kernel_dequantize_gemm.py …

22cc8aa

…to use 'tilelang.language' for consistency

Add CUDA requirements to FP8 test cases and update references for cla…

b004e3c

…rity

Add a blank line for improved readability in test_tilelang_kernel_fp8…

4b5bcb2

…_gemm_mma.py

Fix data type in reference result calculation for consistency in test…

f8d9005

…_tilelang_kernel_gemm_mma_intrinsic.py

Add CUDA requirements and FP8 test cases for matmul and gemv simulations

5b1c005

Remove debug print statements and use tilelang's testing assertion fo…

226ac59

…r result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

Remove outdated comment regarding FP8 tests in test_tilelang_kernel_g…

e03159f

…emv_simt.py

Add BF16 support to matrix multiplication and introduce corresponding…

fcb642e

… test cases

Merge branch 'main' of https://github.com/tile-ai/tilelang into bitblas

deeb142

Add a blank line for improved readability in BF16 GEMM test

d5b057b

Update acknowledgements in README to include supervision by Zhi Yang …

4f99b7c

…at Peking University

enhance acknowledgement

d0396a6

Replace tutorial on memory layout optimization with new tutorial on w…

487ee51

…riting high-performance kernels with thread primitives

Update subproject commit for TVM dependency

bd5560b

Update subproject commit for TVM dependency

af73d99

Add int4_t type and functions for packing char values in CUDA common …

b09e2b5

…header

Merge branch 'main' of https://github.com/tile-ai/tilelang into bitblas

5b3aa98

Add plot_layout example and implement GetForwardVars method in layout…

fd8f421

… classes

Refactor code for improved readability by adjusting line breaks and f…

bf1fdf7

…ormatting in layout and test files

LeiWang1999 added 16 commits February 9, 2025 15:27

Fix formatting by removing unnecessary line break in layout.h

ca5f8a4

Merge branch 'bitblas' of https://github.com/tile-ai/tilelang into bi…

4e6bcc1

…tblas

Refactor make_int4 function for improved readability by adjusting par…

a7b9991

…ameter formatting

Add legend to plot_layout for improved clarity of thread and local IDs

d5731aa

Remove unnecessary dependencies from requirements files for cleaner s…

ebcc3af

…etup

Merge branch 'main' of https://github.com/tile-ai/tilelang into bitblas

3b17437

Remove flash_mha.py and add .gitkeep to deepseek_mla directory

7135aa3

Add build requirements and update installation scripts for improved s…

0e80985

…etup

Introduce carver

6ac074e

Refactor imports and improve code formatting for consistency

275da7e

Add unit tests for carver recommendation hints

c6e6838

lint fix

42e05e0

Enhance ElementwiseTemplate and BaseTemplate with detailed docstrings…

b9f62dd

… for improved code documentation and clarity

Refactor import statements and clean up whitespace in template files …

e989b35

…for improved readability

Add README.md for Carver framework with usage examples and architectu…

71ac2c9

…re support

Merge branch 'main' of https://github.com/tile-ai/tilelang into cost_…

15d2daf

…model

LeiWang1999 changed the title ~~[Carver] Introduce a tile-structure based cost model~~ [Carver] Introduce a tile-structure based cost model for auto tuning Feb 10, 2025

LeiWang1999 merged commit 5e98a8b into tile-ai:main Feb 10, 2025
2 of 3 checks passed

LeiWang1999 mentioned this pull request Feb 10, 2025

[Feature Request] Adapt Hardware Aware Cost Model to enable fast kernel tuning #67

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Carver] Introduce a tile-structure based cost model for auto tuning #70

[Carver] Introduce a tile-structure based cost model for auto tuning #70

Uh oh!

LeiWang1999 commented Feb 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[Carver] Introduce a tile-structure based cost model for auto tuning #70

[Carver] Introduce a tile-structure based cost model for auto tuning #70

Uh oh!

Conversation

LeiWang1999 commented Feb 10, 2025

Carver: A Tile-Structure Based Hint Recommend Framework for Machine Learning Compilers

Key Features

Usage Examples

Basic Usage: General Reduction Template

Matmul Template

Supported Architectures

Adapting Hints to Other Compilers

Supported Templates

TODO Items

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant