Skip to content

Conversation

@LeiWang1999
Copy link
Member

This pull request introduces several enhancements and bug fixes across multiple components of the TileLang compiler and runtime. The changes focus on improving error handling, adding configuration flexibility, optimizing performance, and enhancing code maintainability. Below is a categorized summary of the most important changes:

Error Handling Improvements:

  • Added TILELANG_CHECK macros in both cuda/common.h and hip/common.h to standardize error checking for CUDA and HIP API calls. These macros capture and log errors with detailed information, improving debugging capabilities. [1] [2]

  • Enhanced kernel launch error handling in tilelang/jit/adapter/wrapper.py by adding checks for CUDA errors after kernel execution. Errors are logged with function-specific details, and execution halts on failure.

Layout and Loop Optimization:

  • Updated the LoopPartitioner class in loop_partition.cc to handle fragment buffers more effectively. Introduced logic to avoid replicating loop layouts for fragment buffers, improving performance for certain workloads. [1] [2]

  • Modified the InferLayout function in parallel.cc to prioritize non-replicated buffers for layout inference, enhancing accuracy.

Configuration and Flexibility Enhancements:

  • Introduced a new PassConfigKey class in tilelang/transform/pass_config.py to centralize and document configuration options for TileLang compiler passes. This includes options for enabling/disabling specific optimizations.

  • Updated tilelang/engine/phase.py to allow passing a PassContext object to functions like allow_tma_and_warp_specialized and allow_vectorize, enabling more flexible configuration management. [1] [2] [3]

Codebase Simplification and Maintenance:

  • Replaced direct imports of tvm.transform.PassContext with a unified import in tilelang/transform/__init__.py, ensuring consistency and reducing redundancy.

  • Refactored _load_tile_lang_lib in tilelang/__init__.py to include PassConfigKey, aligning it with new configuration management practices.

Minor Fixes:

  • Fixed indentation in the PREDEF_HOST_FUNC template in tilelang/jit/adapter/wrapper.py to align with coding standards.

…i#441)

* Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
* Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
…g logic

* Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
* Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
* Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
* Updated pass configuration management to streamline vectorization control in the optimization process.
@LeiWang1999 LeiWang1999 merged commit 9fd936c into tile-ai:main Apr 29, 2025
3 checks passed
LeiWang1999 added a commit to LeiWang1999/tilelang that referenced this pull request Jul 18, 2025
* [Enhancement] Improve layout inference accuracy in ParallelOp (tile-ai#441)

* Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
* Enhanced comments to clarify the rationale behind buffer selection in layout inference process.

* [Enhancement] Add error handling macros and refactor loop partitioning logic

* Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
* Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
* Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
* Updated pass configuration management to streamline vectorization control in the optimization process.

* lint fix

* remove debug print
LeiWang1999 added a commit to LeiWang1999/tilelang that referenced this pull request Jul 20, 2025
* [Enhancement] Improve layout inference accuracy in ParallelOp (tile-ai#441)

* Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
* Enhanced comments to clarify the rationale behind buffer selection in layout inference process.

* [Enhancement] Add error handling macros and refactor loop partitioning logic

* Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
* Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
* Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
* Updated pass configuration management to streamline vectorization control in the optimization process.

* lint fix

* remove debug print
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant