-
Notifications
You must be signed in to change notification settings - Fork 817
Add support for tiling LinalgExt::UnPackOp. #10905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
\# Idea The main issue is about incomplete tile. Since all the dimensions are orthogonal, discussing 1-d unpack case is enough. The core idea is to make the input slice have complete tiles. In this case, a larger unpacked tile will be created. We'll need an extract_slice op to shift and truncate the output. \# Example Let's take Nn_to_N as an example. Say that N=32, n=8, and tiling_size=15. The coordinates of second tile (i.e., `result[15..31]`) are `[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]`. The first row and the last row are incomplete in terms of inputs. It's impossible to represent an unpack op using the coordinates. Because the input has higher rank and the math computation of coordinate is using mod and ceilDiv. That's very tricky. To represent the unpack op, we have to complete the rows. I.e., the input coordinates would start with `(1, 0)`; end with `(3, 7)`. In this context, the tiled unpack produces a (3 * n) elements because there are 3 rows in total. Follow by a tensor.extract_slice op, we can get the actual result. The PR relaxes the condition in tiling algorithm because two operations are generated when tiling a unpack op. Since the tiling implementation is using filter, all the generated ops should apply the filter. Otherwise, it runs into infinite loops. (Because the filter is not applied to tiled unpack op.)
|
@chelini this is the implementation of tiling unpack op, PTAL. |
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ill review it in a bit, but FYI, the Tiling implementation here is only used for testing. The actual tiling used in IREE is implemented upstream as scf::tileUsingSCFForOp.
|
I did not consider outer_dim_perms in this PR, I'll think about it and update it later. It should be a minor change like #10907 |
|
I added the support for outer_dim_perms as well. I'm quite sure they are correct when writing the lit test. The dimensions map to each other correctly. :-) |
| struct TiledOp { | ||
| /// Tiled op. | ||
| Operation *op; | ||
| /// Tiled operations that are created during tilng. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tilng -> tiling
| UnPackOp::getTiledImplementation(OpBuilder &builder, | ||
| ArrayRef<OpFoldResult> offsets, | ||
| ArrayRef<OpFoldResult> sizes) { | ||
| if (!hasTensorSemantics()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to be at the tensor level? It is because the op needs an output tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of tiled unpack is larger than tiling size because we have to handle incomplete tile. Restricting it at tensor addresses the issue. There are a couple reasons lead me to adding the restriction.
- The upstream version will be at tensor dialect, so having it works at tensor level SGTM.
- We'll need a larger buffer to store temp result, and copy the data from temp buffer to destination. We can not reuse the destination buffer because the producing output has more data. Vectorization could potentially address the extra buffer issue. Because everything is vector type, they are going to be stored in registers.
- In IREE's pipeline, we apply the tiling optimization and the vectorization before bufferization. That makes the world easier. Having it works at tensors is good enough for IREE.
We're able to do tiling at memref level, but that introduces buffer allocation. It requires the users to understand that vectorization could remove the allocation. That's too many details for users. Since there are no needs on IREE side, I restrict it at tensor level for now. I'm happy to extend it if there are needs. I'll add a comment for it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Thanks a lot for the PR! |
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is OK for now, but upstream TilingInterface might need some changes too?
llvm-external-projects/iree-dialects/lib/Dialect/LinalgExt/IR/LinalgExtOps.cpp
Show resolved
Hide resolved
MaheshRavishankar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually clicked approve by mistake. Requesting changes for not materializing the arith.constant .. : index and affine.apply ops.
I'm not pretty sure if upstream version need some changes or not. My prototype works e2e in #10823 In that PR, I don't modify upstream codes. This PR is enabling the tiling within LinalgExt scope. I'll take look at it when connecting things altogether. |
llvm-external-projects/iree-dialects/lib/Dialect/LinalgExt/IR/LinalgExtOps.cpp
Show resolved
Hide resolved
# Idea The main issue is about incomplete tile. Since all the dimensions are orthogonal, discussing 1-d unpack case is enough. The core idea is to make the input slice have complete tiles. In this case, a larger unpacked tile will be created. We'll need an extract_slice op to shift and truncate the output. # Example Let's take Nn_to_N as an example. Say that N=32, n=8, and tiling_size=15. The coordinates of second tile (i.e., `result[15..31]`) are `[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]`. The first row and the last row are incomplete in terms of inputs. It's impossible to represent an unpack op using the coordinates. Because the input has higher rank and the math computation of coordinate is using mod and ceilDiv. That's very tricky. To represent the unpack op, we have to complete the rows. I.e., the input coordinates would start with `(1, 0)`; end with `(3, 7)`. In this context, the tiled unpack produces a (3 * n) elements because there are 3 rows in total. Follow by a tensor.extract_slice op, we can get the actual result. The PR relaxes the condition in tiling algorithm because two operations are generated when tiling a unpack op. Since the tiling implementation is using filter, all the generated ops should apply the filter. Otherwise, it runs into infinite loops. (Because the filter is not applied to tiled unpack op.)
# Idea The main issue is about incomplete tile. Since all the dimensions are orthogonal, discussing 1-d unpack case is enough. The core idea is to make the input slice have complete tiles. In this case, a larger unpacked tile will be created. We'll need an extract_slice op to shift and truncate the output. # Example Let's take Nn_to_N as an example. Say that N=32, n=8, and tiling_size=15. The coordinates of second tile (i.e., `result[15..31]`) are `[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]`. The first row and the last row are incomplete in terms of inputs. It's impossible to represent an unpack op using the coordinates. Because the input has higher rank and the math computation of coordinate is using mod and ceilDiv. That's very tricky. To represent the unpack op, we have to complete the rows. I.e., the input coordinates would start with `(1, 0)`; end with `(3, 7)`. In this context, the tiled unpack produces a (3 * n) elements because there are 3 rows in total. Follow by a tensor.extract_slice op, we can get the actual result. The PR relaxes the condition in tiling algorithm because two operations are generated when tiling a unpack op. Since the tiling implementation is using filter, all the generated ops should apply the filter. Otherwise, it runs into infinite loops. (Because the filter is not applied to tiled unpack op.)
Some operations need to generate multiple operations when implementing the tiling interface. Here is a sound example in IREE, see iree-org/iree#10905 for more details. Reviewed By: mravishankar Differential Revision: https://reviews.llvm.org/D137300
Idea
The main issue is about incomplete tile. Since all the dimensions are orthogonal, discussing 1-d unpack case is enough. The core idea is to make the input slice have complete tiles. In this case, a larger unpacked tile will be created. We'll need an extract_slice op to shift and truncate the output.
Example
Let's take Nn_to_N as an example. Say that N=32, n=8, and tiling_size=15. The coordinates of second tile (i.e.,
result[15..31]) are[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]. The first row and the last row are incomplete in terms of inputs. It's impossible to represent an unpack op using the coordinates. Because the input has higher rank and the math computation of coordinate is using mod and ceilDiv. That's very tricky.To represent the unpack op, we have to complete the rows. I.e., the input coordinates would start with
(1, 0); end with(3, 7). In this context, the tiled unpack produces a (3 * n) elements because there are 3 rows in total. Follow by a tensor.extract_slice op, we can get the actual result.The PR relaxes the condition in tiling algorithm because two operations are generated when tiling a unpack op. Since the tiling implementation is using filter, all the generated ops should apply the filter. Otherwise, it runs into infinite loops. (Because the filter is not applied to tiled unpack op.)