Skip to content

Conversation

@hanhanW
Copy link
Contributor

@hanhanW hanhanW commented Oct 25, 2022

Idea

The main issue is about incomplete tile. Since all the dimensions are orthogonal, discussing 1-d unpack case is enough. The core idea is to make the input slice have complete tiles. In this case, a larger unpacked tile will be created. We'll need an extract_slice op to shift and truncate the output.

Example

Let's take Nn_to_N as an example. Say that N=32, n=8, and tiling_size=15. The coordinates of second tile (i.e., result[15..31]) are [(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]. The first row and the last row are incomplete in terms of inputs. It's impossible to represent an unpack op using the coordinates. Because the input has higher rank and the math computation of coordinate is using mod and ceilDiv. That's very tricky.

To represent the unpack op, we have to complete the rows. I.e., the input coordinates would start with (1, 0); end with (3, 7). In this context, the tiled unpack produces a (3 * n) elements because there are 3 rows in total. Follow by a tensor.extract_slice op, we can get the actual result.

The PR relaxes the condition in tiling algorithm because two operations are generated when tiling a unpack op. Since the tiling implementation is using filter, all the generated ops should apply the filter. Otherwise, it runs into infinite loops. (Because the filter is not applied to tiled unpack op.)

\# Idea

The main issue is about incomplete tile. Since all the dimensions are orthogonal, discussing 1-d unpack case is enough. The core idea is to make the input slice have complete tiles. In this case, a larger unpacked tile will be created. We'll need an extract_slice op to shift and truncate the output.

\# Example

Let's take Nn_to_N as an example. Say that N=32, n=8, and tiling_size=15. The coordinates of second tile (i.e., `result[15..31]`) are `[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]`. The first row and the last row are incomplete in terms of inputs. It's impossible to represent an unpack op using the coordinates. Because the input has higher rank and the math computation of coordinate is using mod and ceilDiv. That's very tricky.

To represent the unpack op, we have to complete the rows. I.e., the input coordinates would start with `(1, 0)`; end with `(3, 7)`. In this context, the tiled unpack produces a (3 * n) elements because there are 3 rows in total. Follow by a tensor.extract_slice op, we can get the actual result.

The PR relaxes the condition in tiling algorithm because two operations
are generated when tiling a unpack op. Since the tiling implementation
is using filter, all the generated ops should apply the filter.
Otherwise, it runs into infinite loops. (Because the filter is not
applied to tiled unpack op.)
@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 25, 2022

@chelini this is the implementation of tiling unpack op, PTAL.

Copy link
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ill review it in a bit, but FYI, the Tiling implementation here is only used for testing. The actual tiling used in IREE is implemented upstream as scf::tileUsingSCFForOp.

@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 25, 2022

I did not consider outer_dim_perms in this PR, I'll think about it and update it later. It should be a minor change like #10907

@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 25, 2022

I added the support for outer_dim_perms as well. I'm quite sure they are correct when writing the lit test. The dimensions map to each other correctly. :-)

struct TiledOp {
/// Tiled op.
Operation *op;
/// Tiled operations that are created during tilng.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: tilng -> tiling

UnPackOp::getTiledImplementation(OpBuilder &builder,
ArrayRef<OpFoldResult> offsets,
ArrayRef<OpFoldResult> sizes) {
if (!hasTensorSemantics())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to be at the tensor level? It is because the op needs an output tensor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of tiled unpack is larger than tiling size because we have to handle incomplete tile. Restricting it at tensor addresses the issue. There are a couple reasons lead me to adding the restriction.

  1. The upstream version will be at tensor dialect, so having it works at tensor level SGTM.
  2. We'll need a larger buffer to store temp result, and copy the data from temp buffer to destination. We can not reuse the destination buffer because the producing output has more data. Vectorization could potentially address the extra buffer issue. Because everything is vector type, they are going to be stored in registers.
  3. In IREE's pipeline, we apply the tiling optimization and the vectorization before bufferization. That makes the world easier. Having it works at tensors is good enough for IREE.

We're able to do tiling at memref level, but that introduces buffer allocation. It requires the users to understand that vectorization could remove the allocation. That's too many details for users. Since there are no needs on IREE side, I restrict it at tensor level for now. I'm happy to extend it if there are needs. I'll add a comment for it!

Copy link
Contributor

@chelini chelini Oct 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@chelini
Copy link
Contributor

chelini commented Oct 26, 2022

@chelini this is the implementation of tiling unpack op, PTAL.

Thanks a lot for the PR!

Copy link
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is OK for now, but upstream TilingInterface might need some changes too?

Copy link
Collaborator

@MaheshRavishankar MaheshRavishankar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually clicked approve by mistake. Requesting changes for not materializing the arith.constant .. : index and affine.apply ops.

@hanhanW
Copy link
Contributor Author

hanhanW commented Oct 26, 2022

This is OK for now, but upstream TilingInterface might need some changes too?

I'm not pretty sure if upstream version need some changes or not. My prototype works e2e in #10823 In that PR, I don't modify upstream codes. This PR is enabling the tiling within LinalgExt scope. I'll take look at it when connecting things altogether.

@hanhanW hanhanW merged commit 72446a0 into iree-org:main Oct 27, 2022
@hanhanW hanhanW deleted the tile-unpack-2 branch October 27, 2022 17:36
PhaneeshB pushed a commit to PhaneeshB/iree that referenced this pull request Oct 31, 2022
# Idea

The main issue is about incomplete tile. Since all the dimensions are
orthogonal, discussing 1-d unpack case is enough. The core idea is to
make the input slice have complete tiles. In this case, a larger
unpacked tile will be created. We'll need an extract_slice op to shift
and truncate the output.

# Example

Let's take Nn_to_N as an example. Say that N=32, n=8, and
tiling_size=15. The coordinates of second tile (i.e., `result[15..31]`)
are `[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]`. The first row and
the last row are incomplete in terms of inputs. It's impossible to
represent an unpack op using the coordinates. Because the input has
higher rank and the math computation of coordinate is using mod and
ceilDiv. That's very tricky.

To represent the unpack op, we have to complete the rows. I.e., the
input coordinates would start with `(1, 0)`; end with `(3, 7)`. In this
context, the tiled unpack produces a (3 * n) elements because there are
3 rows in total. Follow by a tensor.extract_slice op, we can get the
actual result.

The PR relaxes the condition in tiling algorithm because two operations
are generated when tiling a unpack op. Since the tiling implementation
is using filter, all the generated ops should apply the filter.
Otherwise, it runs into infinite loops. (Because the filter is not
applied to tiled unpack op.)
hanhanW added a commit to hanhanW/iree that referenced this pull request Nov 1, 2022
# Idea

The main issue is about incomplete tile. Since all the dimensions are
orthogonal, discussing 1-d unpack case is enough. The core idea is to
make the input slice have complete tiles. In this case, a larger
unpacked tile will be created. We'll need an extract_slice op to shift
and truncate the output.

# Example

Let's take Nn_to_N as an example. Say that N=32, n=8, and
tiling_size=15. The coordinates of second tile (i.e., `result[15..31]`)
are `[(1, 7), (2, 0,), (2, 1) ... (3, 6), (3, 7)]`. The first row and
the last row are incomplete in terms of inputs. It's impossible to
represent an unpack op using the coordinates. Because the input has
higher rank and the math computation of coordinate is using mod and
ceilDiv. That's very tricky.

To represent the unpack op, we have to complete the rows. I.e., the
input coordinates would start with `(1, 0)`; end with `(3, 7)`. In this
context, the tiled unpack produces a (3 * n) elements because there are
3 rows in total. Follow by a tensor.extract_slice op, we can get the
actual result.

The PR relaxes the condition in tiling algorithm because two operations
are generated when tiling a unpack op. Since the tiling implementation
is using filter, all the generated ops should apply the filter.
Otherwise, it runs into infinite loops. (Because the filter is not
applied to tiled unpack op.)
hanhanW added a commit to llvm/llvm-project that referenced this pull request Nov 4, 2022
Some operations need to generate multiple operations when implementing
the tiling interface. Here is a sound example in IREE, see
iree-org/iree#10905 for more details.

Reviewed By: mravishankar

Differential Revision: https://reviews.llvm.org/D137300
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants