-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR] Implement API for padded layout transformations #12720
Conversation
For producer blocks that iterate over the pre-transformation shape, rewrite to iterate over the post-transformation shape, with `tir::if_then_else` to handle writing to indices corresponding to padding/non-padding.
Unless specifically testing opaque blocks, all unit tests for the transform layout scheduling primitive now operate on non-opaque blocks.
If an IndexMap or Callable, the transformation is the | ||
value to be present in the padding in terms of the | ||
transformed index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cpp side only accepts Optional[PrimExpr]
, seems this is not supported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I had been thinking of it as the (const Array<Var>&, const Array<PrimExpr>&)
call signature on the TE side for the transformation, and was avoiding introducing additional structures. I had forgotten that the TIR schedule accepts an IndexMap
for the transformation, and agree that the C++ side would be better expressed as an Optional<IndexMap>
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updates made to pass Optional<IndexMap> pad_value
throughout C++ API, mimicking how IndexMap index_map
is passed, along with a unit test to validate the functionality.
std::vector<WriteInfo> write_info_; | ||
std::vector<For> active_loops_; | ||
std::unordered_map<const VarNode*, std::pair<size_t, size_t>> loop_depth_lookup_; | ||
std::unordered_map<const VarNode*, PrimExpr> active_let_bindings_; | ||
Optional<BlockRealize> innermost_block_realize_{NullOpt}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document these fields
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, and documentation added here for member vars, along with how they are used when collecting WriteInfo
.
#include "../../../arith/ir_mutator_with_analyzer.h" | ||
#include "../utils.h" | ||
|
||
namespace tvm { | ||
namespace tir { | ||
|
||
class LayoutTransformPlanner : private StmtExprVisitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document the high level algorithm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, and documentation added here for the general algorithm, and when each handling of padding may be used.
Specifically calling attention to how `pad_value` interacts with input buffers, that correctness depends on the calling scope providing the specified `pad_value`.
The previous name `LayoutTransformPlanner` didn't follow the pattern of `TransformLayoutWriter`. Therefore, renaming to `TransformLayoutPlanner`.
Looks like the final failing unit test is due to an incorrect mapping in |
@@ -36,7 +36,7 @@ def shared_16x32_to_ldmatrix_32x16_layout(i, j): | |||
|
|||
|
|||
def shared_32x16_to_ldmatrix_32x16_layout(i, j): | |||
thread_id = (i % 4) + 4 * (j % 8) | |||
thread_id = (i % 16) // 4 + 4 * (j % 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @masahi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tagging @masahi, I had forgotten to do so. I think I have it set up correctly, based on Nvidia documentation and similarity to the (16,32) shape, but couldn't verify definitively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm I think the original mapping is correct, this is from p34 of the slide https://developer.download.nvidia.com/video/gputechconf/gtc/2020/presentations/s21745-developing-cuda-kernels-to-push-tensor-cores-to-the-absolute-limit-on-nvidia-a100.pdf
Sorry I don't remember the details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry I was talking about shared_16x16_to_ldmatrix_32x8_layout
. I need to remember how I came up with shared_32x16_to_ldmatrix_32x16_layout. I think it is used for int8 MMA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if the index map is incorrect, it doesn't affect the correctness of tensorized MMA since the index map is only used for pattern matching purpose...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for looking into it! I wasn't able to find any tests that explicitly validate the transform (e.g. use the transform to generate data in a specific layout, then pass through the mma), as all the tests either started with transformed data, only used the 16x16 shape, or replaced everything with the tensor intrinsic.
I had put together this standalone test to convince myself on it. The main issue with the current index map is that it doesn't map to unique locations (512 input indices map to 128 output indices). It only arose as an issue in this PR, because it generates the inverse in order to determine whether/where padding is required.
Previous version mapped the 512 input indices in a `(32,16)` array to only 128 output indices. This wasn't caught before, because the bijectivity assertion was only triggered for TE schedules.
ed2b141
to
efb25ac
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM, just some comments
) | ||
|
||
try: | ||
iter(mapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for this? According to the doc the mapping function should return a List, it might also need update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to allow the mapping function to return a single PrimExpr
, or something that the ffi can convert into a PrimExpr
. Since it wouldn't make sense for the pad value to provide multiple outputs, I found myself frequently writing lambda i,j : i+j
instead of lambda i,j: [i+j]
. I figured that since I was frequently making that mistake, later users would also likely make it as well, so it would be best to support that functionality.
Good call on the documentation, and I'll update the documentation for from_func
and from_func_with_separators
accordingly.
@@ -2479,6 +2480,31 @@ def transform_layout( | |||
primitive will be called in addition to the | |||
TransformLayout primitive. | |||
|
|||
pad_value: Optional[Union[int, float, PrimExpr, IndexMap, Callable]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the assumption when pad_value is IndexMap. I remember in the RFC we assume it should contain no BufferLoad from buffers except the current buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, and the docstring has been updated. I've also added two unit tests, one that validates that an error is raised when the pad value loads from a different buffer, and one that specifies the intended behavior for pad value that loads from the transformed buffer. The latter is currently marked with pytest.mark.xfail
, as the support isn't implemented yet.
@Hzfengsy Can you review/verify that the requested changes (use non-opaque blocks in unit tests) are made? I think that's the only item remaining on the PR. |
Implementation of API in `tvm.tir.schedule` for layout transformations with padding, as part of apache#12261, item "Insert pad value into generated TIR, using `tir::if_then_else`, `builtin::assume`, and `builtin::undef`". Following the RFC discussion in apache/tvm-rfcs#77 (comment) and apache/tvm-rfcs#77 (comment), this commit preferentially rewrites the loops that surround a padded transformation where possible, in order to express padding in terms of `tir::if_then_else`.
Implementation of API in
tvm.tir.schedule
for layout transformations with padding, as part of #12261, item "Insert pad value into generated TIR, usingtir::if_then_else
,builtin::assume
, andbuiltin::undef
".Following the RFC discussion here and here, this commit preferentially rewrites the loops that surround a padded transformation where possible, in order to express padding in terms of
tir::if_then_else
.cc @Hzfengsy @junrushao1994