-
Notifications
You must be signed in to change notification settings - Fork 58
[Discuss] Relax Layout Transformation #277
Comments
Thank you everyone for the discussion yesterday and bringing up very important points. I'll summarize them below. Q1. [@Hzfengsy] Would layout planning interfere with fusion? Another approach to tackle this would be to pick forward and backward flow based on user choice. Technically the infra to flow layout rewrites should work regardless of the chosen direction of flow. But I do not have a phase ordering example where that would be necessary and there. Can you elaborate on the use case you had in mind where layout planning and fusion would interfere? Q2. [@Hzfengsy] Can we expose an API at Relax level to easily transform layouts of PrimFuncs and operators? Q3. [@slyubomirsky] Can you share an example of how layout rewrites would flow through operators? Q4. [@jinhongyii ]How will you handle layout rewrite flow through pad operation? |
Given broad interests, let me add clarification about how relax layout planner would work with BYOC. Although there could be better approaches with the future improvement in BYOC, in this post, I will explain with the current BYOC approach to focus on delivering the main idea around the interaction between layout planner and BYOC. For each external library/codegen, BYOC interface manages a set of specifications about its supported operators and their constraints. For instance, TensorRT supports Largely, offloading to BYOC happens in several steps:
To best use BYOC, Relay expects users to convert layouts preferably before the offloading steps above.
In case you want to offload only a part of graph (although the graph is fully offload-able), you can customize the annotation by selectively adding/removing the annotation. Please note that the annotated operator nodes will be ignored by internal pipeline by delegating optimization & codegen to external components. By doing so, you may offload one part of graph to BYOC while still leveraging optimization passes like fusion, metaschedule tuning, etc for another. |
Would you be able to use |
@slyubomirsky Yeah, that seems possible. Actually, that is one of the potential improvements discussed offline with @YuchenJin and @psrivas2 as well. |
@psrivas2 I thought that flowing layout through pad will result in some different behavior, but it proved that I'm wrong. So Q4 is no longer a problem. |
What's the status of this? |
@spectrometerHBH we are implementing this. We will be sending out PRs starting next week for review. |
I'm interested in tackling this problem. I've worked on two BYOC backends recently (DNNL and CUTLASS), they both want layouts that are likely different from the input mod (NCHWc or NHWC).
|
@masahi that would be awesome! @sunggg also proposed a direction here. Would be great to get alignment on the approach.
Yes, I have a pending PR locally that adds the layout_transform op. It does not add legalization support though. Let me send it out today. |
Motivation & Goals
Tensor data layout describes how the data is laid out in memory. It determines the memory access pattern and it can significantly impact performance and memory efficiency. Global layout planning thus becomes an important optimization to achieve performance on various hardware backends. The goal of this document is to present the design of global layout planning in Relax.
Terminology
To classify operators at layout perspective, TVM has defined the following terminologies. Note that we do not use these terms in this document, but these can be used in discussions.
relu
,pow
etc. These operators are not affected, neither functionality nor performance, by data layouts.pad
,concatenate
, reduce ops likesum
etc. These operators have some attributes that are functionally affected if we do a layout transformation before them. However, performance-wise, the difference is not significant. For these operators, it is beneficial to just adapt to the previous operator output data layout.conv2d
,conv2d_transpose
etc. These operators are heavily affected, both functionally and performance-wise, by data layouts. They also have data layout as the op attribute. The performance benefit of a layout transformation for these operations, outweighs the runtime cost of performing a layout transformation. Thus, it is beneficial to modify the input data layouts for these operators (if its not a performant data layout), while the rest of layout agnostic and lightly-layout sensitive operators adapt to the layout governed by the output of these heavily-layout sensitive operators.We introduce and clarify the following terms used in this doc.
NCHW->NHWC
followed byNHWC->NCHW
), we can safely eliminate both.Prior Art
Relay has the Convert Layout pass which first applies the user specified layout to the operations in the Relay graph. For example, the user may specify that they want all convolution operations to have
NHWC
layout. TheConvertLayout
pass would then transform the IRModule so that user desired layouts are honored and transforms the Relay graph with minimal number of data layout transforms.It achieves this by requiring each Relay operator to have
InferCorrectLayout
property. The pass uses this property for layout inference. It looks at the original input layouts and the new input layouts and theInferCorrectLayout
**property tells operator needs to be modified to accommodate the new input layouts, and what the new output layouts should be. Layout transforms are inserted where new input layouts differ from incoming layouts. This step is done operator by operator in sequence, where ConvertLayout pass keeps on passing the new layouts to the next operator properties, finally resulting in modifying the whole graph operator-by-operator.Relay has a long list of
InferCorrectLayout
methods attached to operators. The logic for a number of operators is quite complex, error-prone and a huge burden to maintain. In some cases, Relay operators cannot be modified to accommodate tiling layout transforms.Also, Relay performs AlterOpLayout pass to conditionally apply the additional layout transformation (e.g., winograd transformation) given operator, target, etc. As its main usage is more like a secondary optimization that can be applied on the top of global layout planning, we will have a separate design doc to account for it.
Relay Example
Let’s take a look at en example. The Relay graph below preforms a convolution followed by a reduction across
H
axis.If we modify the input and output layouts of convolution operation,
layout_transform
operations are inserted in the graph to maintain correctness. At the same time, theInferCorrectLayout
property ofsum
operation modifies theaxis
attribute of the operation fromaxis = [2]
toaxis = [1]
to accommodate changing the layout fromNCHW
toNHWC
.Global Layout Planning in Relax
In Relax we perform layout planning in a very different way as compared to Relay. In Relay the layout rewrites happens at Relay operator level (i.e.,
relay.nn.split
,relay.sum
etc.). In Relax we apply layout rewrites after lowering Relax operators to TIR. The access to the underlying compute in TIR for an operation makes it much easier to flow layout constraints in Relax.Here is a reduction operation in
NCHW
layout.To transform the above operation from
NCHW
toNHWC
layout, we can simply transform the buffer and block layouts usingtir.schedule.transform_layout
andtir.schedule.transform_block_layout
primitives.Reduce operation in
NHWC
layout.In an ideal scenario, layout planning would figure out the most efficient layout for all operations to minimize the end-to-end cost of graph. However, we are not there yet. For now, we need help either from user or auto tuning system like Metaschedule to identify layout-critical operations in the graph and choose the best layout for them greedily. This would result in a graph where layouts for some operations are frozen (i.e., layouts must not be modified by other passes), and layout rewrite operations added to operands and results edges of these layout-critical operations. Next, an optimization pass would flow these layout rewrites in the graph to reduce the overall cost of such copy operations.
Thus, in Relax we break down the layout planning into two sub problems, which are described separately in the document below.
Tune Layout-Critical Operations & Freeze Layouts
This can be done in conjunction with tuning frameworks like Metaschedule, or a pass that allows user to annotate layouts of specific operations.
HoistLayoutRewritePass
(discussed in appendix) would be added as part of this proposal to pull out the layout rewrites into separate TIR functions and mark the layout-critical TIR function with “frozen layout” attribute.Minimize Layout Rewrite Cost in Graph
The minimize layout rewrite cost problem can be formulated as below:
Given a directed acyclic graph G(V, E), where v $\in$ V is an operation and e $\in$ E are data values flowing through G, a subset of operations are marked as having frozen layout i.e., the operand & result layouts of the operation cannot be modified. The graph contains layout rewrite operations to satisfy the layout constraints of operations, constants, graph inputs and outputs. The goal of layout planning is to minimize the cost of layout rewrite operations in the graph.
Layout Representation
The layout rewrites in Relax can be represented as any of the following two ways.
PrimFunc Representation. Layout rewrites will be represented as TIR functions in the IRModule. This has the benefit of easy serialization/deserialization of these constraints without the introduction of any new structure in the IR. For example, following is a layout rewrite in Relax. Note that not all TIR functions that do spatial transformation of a buffer would be candidates that can flow through operations and optimized away. Only a subset of such PrimFuncs, whose transformation (or inverse) can be represented by a compact IndexMap representation would be candidates which can flow and be optimized. While IndexMap is a nice compact representation for expressing layout rewrites, it also helps to use the same layout representation that is used by tir.schedule.transform_layout API.
Compact Layout Representation: While the PrimFunc representation is very useful in representing layout rewrites generally, it could be difficult to flow/fuse them in their general form. To flow layout rewrites through an operation, a compact representation of layout rewrite is very useful as it could be applied to the operand/result buffers. The compact representation could (1) make it easier to fuse/fold/cancel layout rewrites, and (2) easily be transformed to
tir.schedule
primitives to actually modify layouts of TIR buffers. We propose to have the compact form of layout rewrites be represented through Relax operations.relax.layout_transform(input: Tensor, index_map: lambda)
: This maps the input to a new iteration space. Theindex_map
defines the mapping function. This is a pure layout transform, i.e.,index_map
is a bijective function.relax.pad(input: Tensor, pad_width: Tensor, pad_value: scalar)
: Insertspad_value
to the givenpad_width
locations.pad_width
is an integer tensor with shape[n, 2]
, wheren
is the rank ofinput
. For each dimensiond
ofinput
,pad_width[d, 0]
indicates how many values to add before the contents ofinput
in that dimension, andpad_width[d, 1]
indicates how many values to add after the contents ofinput
in that dimension.relax.crop(input: Tensor, start_indices, slice_sizes, cropped_value)
: Crops the tensor specified instart_indices
andslice_sizes
. The optionalcropped_value
is a hint to the compiler about the values stored in theinput
tensor regions that were cropped away. This is useful information for the compiler if it intends to cancel thisrelax.crop
with a followingrelax.pad
.It should be easy to roundtrip between the PrimFunc representation and compact relax operation representation of layout rewrites. There are some valid questions/concerns that we aim to address on the choice of compact representation of layout rewrites:
Since
relax.layout_transform
only supports bijective transformation, would general PrimFunc layout rewrites (with implicit padding/cropping) have to be broken into smaller primitive TIR blocks in the input IRModule (i.e., Does the layout planner expect prior passes to break general PrimFunc layout rewrites into smaller TIR blocks?)Answer: No. The input IRModule to layout planning pass could use general PrimFunc representation. We can break such PrimFunc blocks into primitive pad/crop and bijective layout rewrites upon conversion to compact form. So there is no constraint on having only bijective layout rewrites/pad/crop TIR blocks by previous passes. Furthermore, layout rewrite TIR blocks within a Primfunc could be annotated with compact representations (may be use
IndexMap
representation).HoistLayoutRewritePass
could use these compact representations instead of recovering the compact form from TIR block.Is it necessary to break the layout rewrites into these primitive relax operations (bijective transformation, pad, and crop) for easier cancellation? Can we fuse two layout rewrites in the PrimFunc representation and prove that it is identity?
Answer: Probably no. If we can have such an analysis in TIR that can prove that a PrimFunc is identity for most of our use cases, that would lower the need to have these primitive representation. A compact representation (
relax.layout_transform
that supports non-bijective rewrites) would still be needed, but we won’t need to break down general layout rewrite into these primitives.Example
To minimize the cost of layout rewrites in the graph, a pass can flow them across operations until they they can be fused into a constant, cancel out or fuse with other layout rewrites. For instance, in the following graph (which we will use as a running example), we have
conv-->add-->conv
graph. The convolution operations have “frozen layouts” (marked in orange). The layout rewrite operations are marked in green.The number of layout rewrites can be reduced by flowing the
to NCHWc
layout rewrite acrossadd
operation from result to operands and then simplifying the graph by folding adjacent layout rewrites (in this case into an identity), and fusing the layout rewrite into a constant.Folding/Fusing Layout Rewrites
The layout rewrites can be reduced through the following mechanisms
relax.transform_layout
operations being folded by folding their index mapsrelax.pad
operation being folded intorelax.crop
operation. Folding these two ops is a bit subtle and must obey the following rules, assuming the indices being padded and cropped in the two operations are the same.relax.crop
operation can always be folded into a priorrelax.pad
operation.relax.pad
operation can only be folded into a priorrelax.crop
operation if all the cropped values areT.undef
or same as the values being padded.to NCHWc
layout rewrite operation was folded into thebias
constant in the above example.Direction of Flowing Layout Rewrites
In order to facilitate folding layout rewrites into other layout rewrites and constants, we intend to be able to flow layout rewrites across operations. In the example above, the layout rewrite post
add
operation was flowed through it (result to operands) thus allowing it to be folded into an inverse layout rewrite and constant.The flow of layout rewrites presents us with a design choice on the direction of flow. The following two options were considered.
split
) would not mess up the access patterns of output buffers.Acknowledging that none of the two choices are strictly better than the other, the arguments for F1 seem more useful for a large class of operations. So for that reason, we would prefer F1.
Flowing Layout Rewrite through an Operation
For the generic support for flowing layout rewrites, we propose two flow mechanism for each abstraction-level. With these two approaches, we believe our approach can support any
IRModule
, which may contain a mixture of Relax operators and TIR functions, within the compilation pipeline.Flowing Layout Rewrite through an Operation at TIR-level
This section would answer the question, “How to generate layout rewrite on the operand buffers of an operation, when flowing layout rewrite through it?”
An analysis of TIR
block
(s) where the result buffer is written to, can help us identify the layout rewrites for operands.Let’s say the TIR PrimFunc has one result (
output
) and multiple operands (arg0, arg1, …
). A layout rewritelambda N, C, H, W: (N, H, W, C // 4, C % 4)
is applied on theoutput
buffer. Our goal is to identify the layout rewrites on all of the operands.B
that writes tooutput
.[T.read](http://T.read)
andT.write
signatures onB
. Foroutput
buffer the access would beoutput[ax0, ax1, ax2, ax3]
. Using the mappingN = ax0, C = ax1, H = ax2, W = ax3
apply the layout rewrite to each of the buffers inT.read
accesses.In the presence of temporary buffer allocation in TIR PrimFunc, we might have to flow the layout rewrites through multiple blocks up to the operands.
The following examples show how this is done for broad classes of operations.
Elementwise Operations.
In the above PrimFunc, let’s say the output has a layout rewrite
lambda N, C, H, W: (N, C // 4, H, W, C%4)
. We want to flow this rewrite fromoutput
toinput
.output
buffer is written in blockcompute
. The signature of the block has information on the buffers it reads and writes. Here it writes to bufferoutput[i0_1, i1_1, i2_1, i3_1]
and reads bufferinput[i0_1, i1_1, i2_1, i3_1]
. Since the access to both these buffers are identical, the layout rewrite would be identical too. Thus, the layout planning would modify bothinput
andoutput
buffers with the same layout rewrite as it flows the rewrite operation throughrelu
op.Broadcast Operations.
Let’s say the output has a layout rewrite operation
lambda N, C, H, W: (N, C // 4, H, W, C%4)
. Our goal is to flow the layout rewrite operation from result to operands. We knowoutput
buffer is written in blockT_add
. The signature has the access patterns foroutput[ax0, ax1, ax2, ax3]
,input[ax0, ax1, ax2, ax3]
andbias[ax1, 0, 0]
buffers. Mapping the axesC == ax1
and applying the layout rewrite we know that the layout rewrites forinput
andbias
buffers.input
:lambda N, C, H, W: (N, C // 4, H, W, C%4)
bias
:lambda i, j, k: (i // 4, 0, 0, i % 4)
Reduction Operations.
Let’s say the output has a layout rewrite operation
lambda N, C: (N, C // 4, C%4)
. Our goal is to flow the layout rewrite operation from result to operand. We knowoutput
buffer is written in blockrxplaceholder_red
. The signature has the access patterns foroutput[ax0, ax1]
andinput[ax0, ax1, k1, k2]
. Mapping the axesC == ax1
and applying the layout rewrite we know that the layout rewrites forinput
buffer will belamda i, j, k, l: (i, j // 4, k, l, j % 4)
Fused Operations.
Fused operations could have the added complication of having a series of blocks. We can use the dependency analysis on buffers using block signatures to propagate the layout rewrite rules from result to operands.
Flowing Layout Rewrite through an Operation at Graph-level
Oftentimes, layout rewrites need to flow at graph-level before lowering to TIR-level. BYOC is a good example - (1) BYOC may have certain layout constraints. (2) BYOC codegen works at graph-level by converting each relax op to BYOC equivalent. Relay solves this problem by introducing
InferCorrectLayout
to provide the manual guidance. Unfortunately, it has been source of many tricky issues (e.g., apache/tvm#10118, apache/tvm#10156, apache/tvm#12007).To overcome this problem, we propose the following the graph-level flow mechanism that peeks at
PrimFunc
implementation while leveraging the powerful TIR-level analysis:PrimFunc
implementation for an operatorPrimfunc
to conform the given layout.PrimFunc
Step 3 might be tricky with the current
PrimFunc
design since we lose the convenient access to op-level info (e.g., operator name, attributes, etc.) at TIR-level; to be clear,PrimFunc
implicitly embeds those information but it may not be easy to extract them. Based on our investigation, it seems possible to extendPrimFunc
to maintain the op-level information in an accessible way. It also reveals that we might be able to maintain this information when flowing transpose-like layout rewrites through them. Note that BYOC use case only requires transpose-like layout rewrites (e.g.,NCHW to NHWC
) which is easier to support than layout rewrites with implicit tiling/padding (e.g.,NCHW to NCHWc
)Fallback: Layout Rewrite Transformation Callback
Although the above sections cover most of the scenarios, in some cases the user might want to have explicit control over how a relax operator or PrimFunc should be modified when flowing layout rewrites through it. For example, the PrimFunc could have an opaque computation, making it hard to figure out the layout rewrites on operands from result layout rewrites. For such cases, an easy way to register such a callback function on an operation would be provided. Similar to
FTVMConvertLayout
property in Relay, it would allow user defined alterations to the operation when flowing layout rewrites. When registered, the callback would be used instead of the analyses mentioned in the previous sections.Advantage of Relax Layout Planning over Relay
The approach described in this doc has multiple advantages over layout planning in Relay.
InferCorrectLayout
property currently has many lines of code with complex logic which is error-prone and huge burden to maintain. In the new layout planning pass, all of this code would not be needed which is huge win. The strategy described above is much more robust and can handle a broad class ofPrimFunc
s even when they do not correspond to operations in Relax operator system.H
represents height,W
represents width, etc. This results in code that could easily break. Relax layout planning does not use strings for layout representation thus avoiding the pitfalls of the Relay and robust against general layout transformations.InferCorrectLayout
for their new operators, which is not necessarily easy to figure out. However, as Relax layout planning does not rely on this, operator registration becomes much easier.ConvertLayout
andOpStrategy
are tightly coupled to each other: starting from user annotation inConvertLayout
and then during lowering,OpStrategy
will find the right implementation accordingly by using the layout information. AsOpStrategy
is a complicated inflexible component that lives outside of pass infra, it makes layout optimization hard to debug and customize in many occasions. On the other hand, in Relax, end-to-end flow can live in the pass infra which can be easier to debug and customize. You can perform layout planning at any stage in the pipeline before codegen, even when you have a partially loweredIRModule
.Appendix
Hoist Layout Rewrites Pass
This pass inspects TIR function and lifts any layout rewrite blocks before and after the computation into separate TIR functions. It could either use some analysis of blocks in the PrimFunc or use explicit block annotations to identify layout rewrite blocks within a PrimFunc.
For example, the
layout_rewrite
block in thematmul
PrimFunc below would be lifted out into a separate PrimFunc after this pass.The text was updated successfully, but these errors were encountered: