Fix/jit kernel use target #648

meinie0826 · 2025-07-20T17:03:25Z

In the JITKernel class, the target is converted through the Target class, but it is not assigned to self.target, so it is not utilized in subsequent compilations.

github-actions · 2025-07-20T17:03:35Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist

Summary of Changes

Hello @meinie0826, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily addresses a critical bug in the JITKernel's handling of compilation targets, ensuring that the specified target is correctly applied during kernel compilation. Beyond this fix, the changes introduce substantial improvements across the codebase, including a refined type system for kernel definitions, a simplified API for autotuning and JIT compilation, and the addition of several new high-performance kernels with accompanying benchmarks and documentation. The project's build system and licensing information have also been updated to reflect ongoing development and collaboration.

Highlights

Bug Fix: Corrected an issue in the JITKernel class where the target parameter was converted but not assigned to self.target, preventing its proper utilization in subsequent compilations.
API Refinement & Type System Update: Updated the tilelang.language API, transitioning from T.Buffer to more semantically precise types like T.Tensor, T.SharedBuffer, and T.FragmentBuffer in kernel definitions and macros. This enhances clarity and enables more advanced optimizations.
Autotuning and JIT Compilation Simplification: Streamlined the usage of @autotune and @jit decorators, allowing them to be applied directly to the outer kernel functions, simplifying the API for users.
New Kernel Implementations & Benchmarks: Introduced new high-performance kernels including sparse matrix multiplication (T.gemm_sp), FP8 matrix multiplication, and optimized FlashMLA implementations. Accompanying benchmarks and examples demonstrate their performance and usage.
Build System & Documentation Enhancements: Updated CMake configurations to allow non-forceful build type setting and enable compile command export. Added new Dockerfiles for CUDA 12.8 and ROCm. Significantly expanded documentation with new tutorials (e.g., FlashMLA, elementwise, GEMV) and an API reference, along with a copyright update.
Submodule Updates: Updated cutlass and tvm submodules to their latest commits, bringing in upstream improvements and fixes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a fix in tilelang/jit/jit.py to correctly utilize the compilation target. It also includes refactoring of autotuner and benchmark scripts for better maintainability, and updates the API from T.Buffer to T.Tensor for improved code clarity. I have provided feedback on minor inconsistencies with the new refactoring patterns to ensure codebase consistency.

gemini-code-assist · 2025-07-20T17:07:49Z

benchmark/matmul/benchmark_matmul_sp.py

This section to generate configs can be improved for readability and conciseness. Consider using a dictionary to define the parameters and then use itertools.product to generate the configurations. This approach is more scalable and easier to maintain.

Suggested change

M : int

The dimension M of the matrix multiplication.

N : int

The dimension N of the matrix multiplication.

K : int

The dimension K of the matrix multiplication.

Returns

-------

(best_latency, best_config, ref_latency)

best_latency : float

The best latency found among the tuned configurations.

best_config : dict

iter_params = dict(

block_M=[64, 128, 256],

block_N=[64, 128, 256],

block_K=[32, 64],

num_stages=[0, 1, 2, 3],

thread_num=[128, 256],

policy=[T.GemmWarpPolicy.Square],

enable_rasterization=[True, False],

)

return [{

k: v for k, v in zip(iter_params, values)

} for values in itertools.product(*iter_params.values())]

gemini-code-assist · 2025-07-20T17:07:49Z

benchmark/matmul/benchmark_matmul_sp.py

This function uses the older pattern of a nested kernel function that is then decorated. This pull request introduces a cleaner pattern in other benchmark files where the main function is directly decorated with @autotune and @jit, and it returns the T.prim_func directly. Refactoring this to match the new pattern would improve consistency across the codebase.

gemini-code-assist · 2025-07-20T17:07:49Z

docs/deeplearning_operators/gemv.md

This documentation seems to be using the old T.Buffer API. This has been updated to T.Tensor across most of the codebase in this PR. To maintain consistency, please update these to T.Tensor.

Suggested change

C_reg[0] = (C_reg[0] + (((float)((half_t*)buf_dyn_shmem)[tk_1]) * ((float)((half_t*)buf_dyn_shmem)[(((((int)threadIdx.x) * 128) + tk_1) + 128)])));

}

tl::fence_proxy_async();

tl::mbarrier_arrive(_mbarrier[1]);

}

@T.prim_func

def main(

A: T.Tensor(A_shape, in_dtype),

B: T.Tensor(B_shape, in_dtype),

C: T.Tensor((M, N), out_dtype),

gemini-code-assist · 2025-07-20T17:07:49Z

benchmark/blocksparse_attention/benchmark_tilelang_block_sparse_fmha.py

The type annotation here should be SharedBuffer and FragmentBuffer instead of Buffer.

Suggested change

K_shared: T.SharedBuffer([block_N, dim], dtype),

acc_s: T.FragmentBuffer([block_M, block_N], accum_dtype),

k: T.int32,

bx: T.int32,

Q_shared: T.SharedBuffer([block_M, dim], dtype),

K_shared: T.SharedBuffer([block_N, dim], dtype),

acc_s: T.FragmentBuffer([block_M, block_N], accum_dtype),

gemini-code-assist · 2025-07-20T17:07:50Z

benchmark/blocksparse_attention/benchmark_tilelang_block_sparse_fmha.py

The type annotation here should be SharedBuffer and FragmentBuffer instead of Buffer.

Suggested change

acc_s_cast: T.FragmentBuffer([block_M, block_N], dtype),

acc_o: T.FragmentBuffer([block_M, dim], accum_dtype),

k: T.int32,

by: T.int32,

V: T.Tensor(shape, dtype),

V_shared: T.SharedBuffer([block_M, dim], dtype),

acc_s_cast: T.FragmentBuffer([block_M, block_N], dtype),

gemini-code-assist · 2025-07-20T17:07:50Z

benchmark/blocksparse_attention/benchmark_tilelang_block_sparse_fmha.py

The type annotation here should be FragmentBuffer instead of Buffer.

acc_s: T.FragmentBuffer([block_M, block_N], accum_dtype), acc_s_cast: T.FragmentBuffer([block_M, block_N], dtype), scores_max: T.FragmentBuffer([block_M], accum_dtype), scores_max_prev: T.FragmentBuffer([block_M], accum_dtype), scores_scale: T.FragmentBuffer([block_M], accum_dtype), scores_sum: T.FragmentBuffer([block_M], accum_dtype),

meinie0826 · 2025-07-20T17:24:57Z

I'm sorry, I forgot to pull the latest PR and pushed the wrong code. Gemini has generated many comments on it, don't worry

…t in JITKernel - Removed the unused `target` attribute from the `JITKernel` class. - Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation.

LeiWang1999 · 2025-07-21T05:53:28Z

@meinie0826 Thank you, and apologies for the inconvenience as the main branch has been force-updated for compliance purposes.

LGTM, merged:)

* fix: Copy Target to self.target * refactor: Remove unused target attribute and adjust context management in JITKernel - Removed the unused `target` attribute from the `JITKernel` class. - Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation. --------- Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

gemini-code-assist bot reviewed Jul 20, 2025

View reviewed changes

meinie0826 closed this Jul 20, 2025

meinie0826 force-pushed the fix/JITKernel_use_target branch from c3e5300 to fec9b93 Compare July 20, 2025 17:11

fix: Copy Target to self.target

80b0d22

meinie0826 reopened this Jul 20, 2025

LeiWang1999 merged commit 6e994b1 into tile-ai:main Jul 21, 2025
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/jit kernel use target #648

Fix/jit kernel use target #648

Uh oh!

meinie0826 commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

gemini-code-assist bot Jul 20, 2025

Uh oh!

meinie0826 commented Jul 20, 2025

Uh oh!

LeiWang1999 commented Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-    M : int
-        The dimension M of the matrix multiplication.
-    N : int
-        The dimension N of the matrix multiplication.
-    K : int
-        The dimension K of the matrix multiplication.
-    Returns
-    -------
-    (best_latency, best_config, ref_latency)
-        best_latency : float
-            The best latency found among the tuned configurations.
-        best_config : dict
+        iter_params = dict(
+            block_M=[64, 128, 256],
+            block_N=[64, 128, 256],
+            block_K=[32, 64],
+            num_stages=[0, 1, 2, 3],
+            thread_num=[128, 256],
+            policy=[T.GemmWarpPolicy.Square],
+            enable_rasterization=[True, False],
+        )
+        return [{
+            k: v for k, v in zip(iter_params, values)
+        } for values in itertools.product(*iter_params.values())]

-        C_reg[0] = (C_reg[0] + (((float)((half_t*)buf_dyn_shmem)[tk_1]) * ((float)((half_t*)buf_dyn_shmem)[(((((int)threadIdx.x) * 128) + tk_1) + 128)])));
-      }
-      tl::fence_proxy_async();
-      tl::mbarrier_arrive(_mbarrier[1]);
-    }
+    @T.prim_func
+    def main(
+        A: T.Tensor(A_shape, in_dtype),
+        B: T.Tensor(B_shape, in_dtype),
+        C: T.Tensor((M, N), out_dtype),

-            K_shared: T.SharedBuffer([block_N, dim], dtype),
-            acc_s: T.FragmentBuffer([block_M, block_N], accum_dtype),
-            k: T.int32,
-            bx: T.int32,
+            Q_shared: T.SharedBuffer([block_M, dim], dtype),
+            K_shared: T.SharedBuffer([block_N, dim], dtype),
+            acc_s: T.FragmentBuffer([block_M, block_N], accum_dtype),

-            acc_s_cast: T.FragmentBuffer([block_M, block_N], dtype),
-            acc_o: T.FragmentBuffer([block_M, dim], accum_dtype),
-            k: T.int32,
-            by: T.int32,
+            V: T.Tensor(shape, dtype),
+            V_shared: T.SharedBuffer([block_M, dim], dtype),
+            acc_s_cast: T.FragmentBuffer([block_M, block_N], dtype),

Fix/jit kernel use target #648

Fix/jit kernel use target #648

Uh oh!

Conversation

meinie0826 commented Jul 20, 2025

Uh oh!

github-actions bot commented Jul 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jul 20, 2025

Choose a reason for hiding this comment

Uh oh!

meinie0826 commented Jul 20, 2025

Uh oh!

LeiWang1999 commented Jul 21, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants