[TIR] Implement API for padded layout transformations #12720

Lunderberg · 2022-09-06T21:27:57Z

Implementation of API in tvm.tir.schedule for layout transformations with padding, as part of #12261, item "Insert pad value into generated TIR, using tir::if_then_else, builtin::assume, and builtin::undef".

Following the RFC discussion here and here, this commit preferentially rewrites the loops that surround a padded transformation where possible, in order to express padding in terms of tir::if_then_else.

cc @Hzfengsy @junrushao1994

For producer blocks that iterate over the pre-transformation shape, rewrite to iterate over the post-transformation shape, with `tir::if_then_else` to handle writing to indices corresponding to padding/non-padding.

Hzfengsy · 2022-09-07T00:34:47Z

cc @wrongtest-intellif @vinx13

include/tvm/tir/schedule/schedule.h

tests/python/unittest/test_tir_schedule_transform_layout.py

Unless specifically testing opaque blocks, all unit tests for the transform layout scheduling primitive now operate on non-opaque blocks.

include/tvm/tir/schedule/schedule.h

vinx13 · 2022-09-07T18:27:52Z

python/tvm/tir/schedule/schedule.py

+            If an IndexMap or Callable, the transformation is the
+            value to be present in the padding in terms of the
+            transformed index.


cpp side only accepts Optional[PrimExpr], seems this is not supported?

Good point. I had been thinking of it as the (const Array<Var>&, const Array<PrimExpr>&) call signature on the TE side for the transformation, and was avoiding introducing additional structures. I had forgotten that the TIR schedule accepts an IndexMap for the transformation, and agree that the C++ side would be better expressed as an Optional<IndexMap> instead.

Updates made to pass Optional<IndexMap> pad_value throughout C++ API, mimicking how IndexMap index_map is passed, along with a unit test to validate the functionality.

vinx13 · 2022-09-07T19:26:49Z

src/tir/schedule/primitive/layout_transformation.cc

+  std::vector<WriteInfo> write_info_;
+  std::vector<For> active_loops_;
+  std::unordered_map<const VarNode*, std::pair<size_t, size_t>> loop_depth_lookup_;
+  std::unordered_map<const VarNode*, PrimExpr> active_let_bindings_;
+  Optional<BlockRealize> innermost_block_realize_{NullOpt};


document these fields

Thank you, and documentation added here for member vars, along with how they are used when collecting WriteInfo.

vinx13 · 2022-09-07T19:32:52Z

src/tir/schedule/primitive/layout_transformation.cc

 #include "../../../arith/ir_mutator_with_analyzer.h"
 #include "../utils.h"

 namespace tvm {
 namespace tir {

+class LayoutTransformPlanner : private StmtExprVisitor {


document the high level algorithm

Thank you, and documentation added here for the general algorithm, and when each handling of padding may be used.

Specifically calling attention to how `pad_value` interacts with input buffers, that correctness depends on the calling scope providing the specified `pad_value`.

The previous name `LayoutTransformPlanner` didn't follow the pattern of `TransformLayoutWriter`. Therefore, renaming to `TransformLayoutPlanner`.

Lunderberg · 2022-09-13T14:38:53Z

Looks like the final failing unit test is due to an incorrect mapping in tir.tensor_intrin.cuda.shared_32x16_to_ldmatrix_32x16_layout. It currently returns [(i % 4) + 4 * (j % 8), 8 * (j // 8) + (i // 16) * 4 + i % 4], which doesn't fill all 32x16 indices, and fails when attempted to use NonSurjectiveInverse on it.

vinx13 · 2022-09-13T20:12:04Z

python/tvm/tir/tensor_intrin/cuda.py

@@ -36,7 +36,7 @@ def shared_16x32_to_ldmatrix_32x16_layout(i, j):


 def shared_32x16_to_ldmatrix_32x16_layout(i, j):
-    thread_id = (i % 4) + 4 * (j % 8)
+    thread_id = (i % 16) // 4 + 4 * (j % 8)


Thank you for tagging @masahi, I had forgotten to do so. I think I have it set up correctly, based on Nvidia documentation and similarity to the (16,32) shape, but couldn't verify definitively.

hmm I think the original mapping is correct, this is from p34 of the slide https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf

Sorry I don't remember the details

ah sorry I was talking about shared_16x16_to_ldmatrix_32x8_layout. I need to remember how I came up with shared_32x16_to_ldmatrix_32x16_layout. I think it is used for int8 MMA.

Even if the index map is incorrect, it doesn't affect the correctness of tensorized MMA since the index map is only used for pattern matching purpose...

Thank you for looking into it! I wasn't able to find any tests that explicitly validate the transform (e.g. use the transform to generate data in a specific layout, then pass through the mma), as all the tests either started with transformed data, only used the 16x16 shape, or replaced everything with the tensor intrinsic.

I had put together this standalone test to convince myself on it. The main issue with the current index map is that it doesn't map to unique locations (512 input indices map to 128 output indices). It only arose as an issue in this PR, because it generates the inverse in order to determine whether/where padding is required.

Previous version mapped the 512 input indices in a `(32,16)` array to only 128 output indices. This wasn't caught before, because the bijectivity assertion was only triggered for TE schedules.

vinx13

Overall LGTM, just some comments

vinx13 · 2022-09-15T17:03:30Z

python/tvm/tir/function.py

-                )
+
+        try:
+            iter(mapping)


What's the use case for this? According to the doc the mapping function should return a List, it might also need update

This was to allow the mapping function to return a single PrimExpr, or something that the ffi can convert into a PrimExpr. Since it wouldn't make sense for the pad value to provide multiple outputs, I found myself frequently writing lambda i,j : i+j instead of lambda i,j: [i+j]. I figured that since I was frequently making that mistake, later users would also likely make it as well, so it would be best to support that functionality.

Good call on the documentation, and I'll update the documentation for from_func and from_func_with_separators accordingly.

vinx13 · 2022-09-15T17:05:01Z

python/tvm/tir/schedule/schedule.py

@@ -2479,6 +2480,31 @@ def transform_layout(
            primitive will be called in addition to the
            TransformLayout primitive.

+        pad_value: Optional[Union[int, float, PrimExpr, IndexMap, Callable]]


Document the assumption when pad_value is IndexMap. I remember in the RFC we assume it should contain no BufferLoad from buffers except the current buffer

Thank you, and the docstring has been updated. I've also added two unit tests, one that validates that an error is raised when the pad value loads from a different buffer, and one that specifies the intended behavior for pad value that loads from the transformed buffer. The latter is currently marked with pytest.mark.xfail, as the support isn't implemented yet.

Lunderberg · 2022-09-16T20:47:44Z

@Hzfengsy Can you review/verify that the requested changes (use non-opaque blocks in unit tests) are made? I think that's the only item remaining on the PR.

Implementation of API in `tvm.tir.schedule` for layout transformations with padding, as part of apache#12261, item "Insert pad value into generated TIR, using `tir::if_then_else`, `builtin::assume`, and `builtin::undef`". Following the RFC discussion in apache/tvm-rfcs#77 (comment) and apache/tvm-rfcs#77 (comment), this commit preferentially rewrites the loops that surround a padded transformation where possible, in order to express padding in terms of `tir::if_then_else`.

Lunderberg added 13 commits September 6, 2022 12:04

[UnitTests] Initial unit tests for padded transformation behavior

93559cf

[Utils][Fix] Correction for non-empty Callable type annotations

60ea527

[TIR] Pass the pad_value argument from Python to C++

2971b5b

[TIR] Added check to validate lack of transformation padding

885fd78

Raise error if pad value doesn't match buffer's data type.

185eead

Simplify expresions in IndexMap::NonsurjectiveInverse

874bfc2

Preparatory refactor, update BlockNode::alloc_buffers while visiting

ddea093

Introduced LayoutTransformPlanner for planning how to pad

2055bbf

Implemented insertion of T.assume for input buffers

f3538cd

Implement epilogue plan for explicitly setting pad value

a6dbd30

Check LetStmt bindings when determining loop dependencies

619c5b7

Implement replacement plan for using tir::if_then_else

aa9bbf7

For producer blocks that iterate over the pre-transformation shape, rewrite to iterate over the post-transformation shape, with `tir::if_then_else` to handle writing to indices corresponding to padding/non-padding.

Removed development-only unit test

7f5707c

Lunderberg requested a review from vinx13 September 6, 2022 21:27

Hzfengsy requested changes Sep 7, 2022

View reviewed changes

include/tvm/tir/schedule/schedule.h Outdated Show resolved Hide resolved

tests/python/unittest/test_tir_schedule_transform_layout.py Show resolved Hide resolved

Lunderberg added 5 commits September 7, 2022 08:44

Add default value of NullOpt for pad_value

5a1e63f

Removed debug code

c463043

Update unit tests to use non-opaque blocks

98a8446

Unless specifically testing opaque blocks, all unit tests for the transform layout scheduling primitive now operate on non-opaque blocks.

Resolve linting error

19af1ee

Removed iter_var usage from T.where clauses

8eb775a

vinx13 requested changes Sep 7, 2022

View reviewed changes

Lunderberg added 2 commits September 7, 2022 16:36

Improve docstring on pad_value

e874020

Specifically calling attention to how `pad_value` interacts with input buffers, that correctness depends on the calling scope providing the specified `pad_value`.

Documentation for TransformLayoutPlanner, rename for consistency

6386db5

The previous name `LayoutTransformPlanner` didn't follow the pattern of `TransformLayoutWriter`. Therefore, renaming to `TransformLayoutPlanner`.

wrongtest-intellif mentioned this pull request Sep 10, 2022

[TIR, Schedule] Add schedule primitive PadEinsum #12750

Merged

Lunderberg added 2 commits September 12, 2022 10:37

Updated C++ API to take IndexMap as input

59a0acf

Update expected trace in multi-level tiling tests

d532610

vinx13 reviewed Sep 13, 2022

View reviewed changes

Update shared_32x16_to_ldmatrix_32x16_layout to be injective

efb25ac

Previous version mapped the 512 input indices in a `(32,16)` array to only 128 output indices. This wasn't caught before, because the bijectivity assertion was only triggered for TE schedules.

Lunderberg force-pushed the padded_layout_api branch from ed2b141 to efb25ac Compare September 13, 2022 20:30

vinx13 reviewed Sep 15, 2022

View reviewed changes

Lunderberg added 4 commits September 15, 2022 12:21

Updated IndexMap docstrings for single PrimExpr returns

13b8cef

Updated docstring for valid pad values, validate

19a78e8

Merge branch 'main' into padded_layout_api

d801dab

Fix lint error

6a4f4cc

vinx13 approved these changes Sep 16, 2022

View reviewed changes

Hzfengsy approved these changes Sep 16, 2022

View reviewed changes

Lunderberg merged commit 2af9b90 into apache:main Sep 19, 2022

Lunderberg deleted the padded_layout_api branch September 19, 2022 13:20

zxybazh mentioned this pull request Sep 20, 2022

[Bug] Meta Schedule Layout Rewrite Failure #12852

Closed

Lunderberg mentioned this pull request Sep 23, 2022

[Tracking Issue] Padded Layout Transformations #12261

Open

14 tasks

AndrewZhaoLuo mentioned this pull request Oct 4, 2022

TVM v0.10.0.rc0 Release Candidate Notes #12979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIR] Implement API for padded layout transformations #12720

[TIR] Implement API for padded layout transformations #12720

Lunderberg commented Sep 6, 2022 •

edited by github-actions bot

Loading

Hzfengsy commented Sep 7, 2022

vinx13 Sep 7, 2022

Lunderberg Sep 7, 2022

Lunderberg Sep 12, 2022

vinx13 Sep 7, 2022

Lunderberg Sep 8, 2022

vinx13 Sep 7, 2022

Lunderberg Sep 8, 2022

Lunderberg commented Sep 13, 2022

vinx13 Sep 13, 2022

Lunderberg Sep 13, 2022

masahi Sep 13, 2022

masahi Sep 13, 2022 •

edited

Loading

masahi Sep 13, 2022

Lunderberg Sep 14, 2022

vinx13 left a comment

vinx13 Sep 15, 2022

Lunderberg Sep 15, 2022

vinx13 Sep 15, 2022

Lunderberg Sep 16, 2022

Lunderberg commented Sep 16, 2022

[TIR] Implement API for padded layout transformations #12720

[TIR] Implement API for padded layout transformations #12720

Conversation

Lunderberg commented Sep 6, 2022 • edited by github-actions bot Loading

Hzfengsy commented Sep 7, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lunderberg commented Sep 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

masahi Sep 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinx13 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lunderberg commented Sep 16, 2022

Lunderberg commented Sep 6, 2022 •

edited by github-actions bot

Loading

masahi Sep 13, 2022 •

edited

Loading