Skip to content

Conversation

@LeiWang1999
Copy link
Member

This pull request introduces a new cumulative sum (cumsum) operation in the TileLang framework, along with its implementation, integration, and testing. The changes include adding the CumSumOp operator, implementing the CUDA kernel for cumsum, updating the TileLang language interface, and adding comprehensive tests for both shared memory and fragment scopes.

New Feature: cumsum Operation

  • Operator Implementation:

    • Added CumSumOp class in src/op/reduce.h and its corresponding registration in src/op/reduce.cc. This operator supports cumulative summation along a specified dimension, with an option for reverse mode. It includes methods for lowering to target-specific code and layout inference. [1] [2]
  • CUDA Kernel:

    • Implemented the CumSum2D template in src/tl_templates/cuda/reduce.h for efficient cumulative summation on 2D data. It supports both forward and reverse modes and handles different thread configurations.

Integration into TileLang

  • Language Extension:
    • Added a cumsum macro and function to the TileLang language interface in tilelang/language/reduce.py, allowing users to invoke the cumsum operation on buffers. The implementation supports both shared memory and fragment scopes.
    • Registered cumsum in the TileLang language module (tilelang/language/__init__.py).

Testing

  • New Tests for cumsum:

    • Added test_tilelang_language_cumsum.py to validate the cumsum operation for both shared memory and fragment scopes. It includes tests for different data types, dimensions, and reverse modes.
  • Improved Debugging:

    • Enhanced the torch_assert_close utility in tilelang/utils/tensor.py to include detailed mismatch information, including the left-hand side (LHS) and right-hand side (RHS) tensors.

* Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
* Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
* Created tests for CumSum functionality in shared memory and fragment contexts.
* Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
* Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.
@LeiWang1999 LeiWang1999 merged commit 5e86968 into tile-ai:main Apr 22, 2025
2 of 3 checks passed
LeiWang1999 added a commit to LeiWang1999/tilelang that referenced this pull request Jul 18, 2025
* [Feature] Implement CumSum operation in TileLang

* Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
* Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
* Created tests for CumSum functionality in shared memory and fragment contexts.
* Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
* Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.

* lint fix
LeiWang1999 added a commit to LeiWang1999/tilelang that referenced this pull request Jul 20, 2025
* [Feature] Implement CumSum operation in TileLang

* Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
* Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
* Created tests for CumSum functionality in shared memory and fragment contexts.
* Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
* Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.

* lint fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant