[relay] Relay annotation and partitioning for external compilers #4570

zhiics · 2019-12-23T03:33:14Z

This is the partitioning part of #4258

Updated

Here, we only focus on partitioning and expect developers to annotate the program first. We provide some examples to help developers to write their own annotators using pass manager directly from Python frontend.

How to use

mod = relay.testing.mobilenet.get_workload(batch_size=1, dtype='float32')
# see unit test for examples of the annotator
mod = MyAnnotator(arg1, arg2,...)(mod)
mod = relay.transform.PartitionGraph()(mod)
json, lib, param = relay.build(mod, target=target, params=params)

Followup PRs will be sent to:

provide a convenient annotation framework which may need more discussion and thoughts on how to combine together with OpStrategy
tutorials to help developers add external codegen/runtime and create their own build pipeline.
add more comprehensive operator support for dnnl

CC @tqchen @jroesch @icemelon9 @u99127 @masahi @u99127 @soiferj @comaniac

masahi · 2019-12-25T02:27:07Z

@zhiics when I run test_pass_partition_graph.py, I get a warning which says

add in ccompiler is not registered. Fallback to CPU
multiply in ccompiler is not registered. Fallback to CPU
add in ccompiler is not registered. Fallback to CPU
subtract in ccompiler is not registered. Fallback to CPU

Is this expected? Or do I need to configure something on cmake?

zhiics · 2019-12-25T06:24:33Z

@masahi Sorry. I mistakenly changed the name of "ccompiler" directly. It therefore cannot find the registry and fallback to cpu. Now it is fixed. Please try it again.

masahi · 2019-12-25T09:06:09Z

ok, fixed now.

I have two questions regarding how this PR would interact with other transformation passes:

Obviously operator fusion should be disabled for the subgraph that would be sent to external tools, but for other nodes that run on CPU I want operator fusion to work as usual. Is that possible?
I want subgraphs I get from this pass to be already quantized, so the quantization pass should happen before this pass. Are quantization and this pass going to work without any issues?

zhiics · 2019-12-25T18:39:16Z

@masahi

Obviously operator fusion should be disabled for the subgraph that would be sent to external tools, but for other nodes that run on CPU I want operator fusion to work as usual. Is that possible?

Yes, this is possible. I just added a unit test for this. Please check it out.

I want subgraphs I get from this pass to be already quantized, so the quantization pass should happen before this pass. Are quantization and this pass going to work without any issues?

Yes, you should be able customize your opt pass pipeline. For example:

# Given a relay module, mod
with relay.quantize.qconfig(...):
    q_mod = relay.quantize.quantize(mod, params)

seq1 = [relay.transform.xx, relay.transform.yy]
with relay.build_config(...)
    mod = seq1(q_mod)

mod = relay.build_extern_compiler(mod, "ccompiler")

relaly.build_config(...):
    graph, lib, params = relay.build(mod, "llvm")

masahi · 2019-12-26T01:40:52Z

@zhiics great, thanks.

masahi · 2019-12-27T05:40:33Z

@zhiics When I run the mobilenet test and dump the annotated graph, batch norm is not converted to dnnl one.

  %150 = fn (%dnnl_input75: Tensor[(1, 1024, 7, 7), float32], %dnnl_input76: Tensor[(1024, 1, 3, 3), float32], Compiler="dnnl", ExternalSymbol="dnnl_4", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.conv2d(%dnnl_input75, %dnnl_input76, padding=[1, 1], groups=1024, channels=1024, kernel_size=[3, 3]) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };
  %151 = %150(%149, %separable_conv_block_13_depthwise_conv1_weight) /* ty=Tensor[(1, 1024, 7, 7), float32] */;
  %152 = nn.batch_norm(%151, %separable_conv_block_13_bn1_gamma, %separable_conv_block_13_bn1_beta, %separable_conv_block_13_bn1_moving_mean, %separable_conv_block_13_bn1_moving_var) /* ty=(Tensor[(1, 1024, 7, 7), float32], Tensor[(1024), float32], Tensor[(1024), float32]) */;
  %153 = %152.0;
  %154 = fn (%dnnl_input77: Tensor[(1, 1024, 7, 7), float32], Compiler="dnnl", ExternalSymbol="dnnl_3", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.relu(%dnnl_input77) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };

It is probably because batch norm is decomposed during SimplifyInference. Is batch norm supposed to be around during external codegen time? I think it should be. I don't hit "else if (IsOp(call, "nn.batch_norm")" condition here during mobilenet test.

I want to see an example of matching a pattern like Conv + BN + Relu and translate them to dnnl's fused op. It would be great if you could do that in this or future PR, otherwise I'll do it as a practice and send a PR :)

comaniac · 2019-12-27T05:55:55Z

@zhiics When I run the mobilenet test and dump the annotated graph, batch norm is not converted to dnnl one.

  %150 = fn (%dnnl_input75: Tensor[(1, 1024, 7, 7), float32], %dnnl_input76: Tensor[(1024, 1, 3, 3), float32], Compiler="dnnl", ExternalSymbol="dnnl_4", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.conv2d(%dnnl_input75, %dnnl_input76, padding=[1, 1], groups=1024, channels=1024, kernel_size=[3, 3]) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };
  %151 = %150(%149, %separable_conv_block_13_depthwise_conv1_weight) /* ty=Tensor[(1, 1024, 7, 7), float32] */;
  %152 = nn.batch_norm(%151, %separable_conv_block_13_bn1_gamma, %separable_conv_block_13_bn1_beta, %separable_conv_block_13_bn1_moving_mean, %separable_conv_block_13_bn1_moving_var) /* ty=(Tensor[(1, 1024, 7, 7), float32], Tensor[(1024), float32], Tensor[(1024), float32]) */;
  %153 = %152.0;
  %154 = fn (%dnnl_input77: Tensor[(1, 1024, 7, 7), float32], Compiler="dnnl", ExternalSymbol="dnnl_3", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.relu(%dnnl_input77) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };

The BN op should be turned off in DNNL codegen because we haven't supported multiple outputs yet.

It is probably because batch norm is decomposed during SimplifyInference. Is batch norm supposed to be around during external codegen time? I think it should be. I don't hit "else if (IsOp(call, "nn.batch_norm")" condition here during mobilenet test.

The BN op will not be decomposed if it is marked as an external op.

I want to see an example of matching a pattern like Conv + BN + Relu and translate them to dnnl's fused op. It would be great if you could do that in this or future PR, otherwise I'll do it as a practice and send a PR :)
Op fusion is not considered in this PR. It should be follow-up work after this PR is merged.

masahi · 2019-12-27T06:01:22Z

If BN is not decomposed, why am I not seeing dnnl_bn generated?

comaniac · 2019-12-27T06:10:12Z

If BN is not decomposed, why am I not seeing dnnl_bn generated?

It is not marked as external because batch_norm is always False in https://github.com/apache/incubator-tvm/pull/4570/files#diff-245f59aa1968d2efdb8357cfbdf149c9R40-R44

masahi · 2019-12-27T06:17:07Z

@comaniac ok thanks, I'll study the code.

comaniac · 2019-12-27T06:21:45Z

BTW, the batch_norm is also set to not decompose here: https://github.com/apache/incubator-tvm/blob/master/src/relay/pass/fuse_ops.cc#L243. This is not a general solution so we also plan to improve it later.

masahi · 2019-12-27T06:29:44Z

Ideally whether to keep batch norm or not should be configurable. Actually for my use case, I want batch norm to be decomposed and its multiplication folded into previous conv, so that I don't have to worry about supporting batch norm on my end. But I expect others have a special op for handling fused Conv + BN + Relu.

UPDATE: I realized this is already achieved by allowing each backend to have its own annotate_compiler registration.

comaniac · 2019-12-27T06:50:13Z

Ideally whether to keep batch norm or not should be configurable. Actually for my use case, I want batch norm to be decomposed and its multiplication folded into previous conv, so that I don't have to worry about supporting batch norm on my end. But I expect others have a special op for handling fused Conv + BN + Relu.

UPDATE: I realized this is already achieved by allowing each backend to have its own annotate_compiler registration.

Yes you are right. This PR provides an interface (annotation) for specialized codegen to do subgraph pattern matching for fusion or similar optimization. However, we also realize that this may be painful for developers and it might be better to provide a more convenient pattern matching for widely used cases (e.g., Conv+BN+ReLU as you mentioned). Our follow-up plans to focus more on the graph partitioning algorithm, and this is one of the cases we want to cover.

tests/python/relay/test_pass_partition_graph.py

include/tvm/relay/attrs/annotation.h

masahi · 2019-12-27T08:10:01Z

For others who might be interested in this PR, here is a before/after of partitioning in test_multi_node_compiler().

Before

fn (%x: Tensor[(10, 10), float32], %w0: Tensor[(10, 10), float32], %w1: Tensor[(10, 10), float32], %w2: Tensor[(10, 10), float32], %w3: Tensor[(10, 10), float32], %w4: Tensor[(10, 10), float32], %w5: Tensor[(10, 10), float32], %w6: Tensor[(10, 10), float32], %w7: Tensor[(10, 10), float32]) -> Tensor[(30, 10), float32] {
  %0 = annotation.compiler_begin(%x, meta[relay.attrs.CompilerAttrs][0]) /* ty=Tensor[(10, 10), float32] */;
  %1 = annotation.compiler_begin(%w0, meta[relay.attrs.CompilerAttrs][1]) /* ty=Tensor[(10, 10), float32] */;
  %2 = add(%0, %1) /* ty=Tensor[(10, 10), float32] */;
  %3 = annotation.compiler_begin(%w1, meta[relay.attrs.CompilerAttrs][2]) /* ty=Tensor[(10, 10), float32] */;
  %4 = subtract(%2, %3) /* ty=Tensor[(10, 10), float32] */;
  %5 = annotation.compiler_begin(%w2, meta[relay.attrs.CompilerAttrs][3]) /* ty=Tensor[(10, 10), float32] */;
  %6 = multiply(%4, %5) /* ty=Tensor[(10, 10), float32] */;
  %7 = annotation.compiler_end(%6, meta[relay.attrs.CompilerAttrs][4]) /* ty=Tensor[(10, 10), float32] */;
  %8 = annotation.compiler_begin(%x, meta[relay.attrs.CompilerAttrs][5]) /* ty=Tensor[(10, 10), float32] */;
  %9 = annotation.compiler_begin(%w3, meta[relay.attrs.CompilerAttrs][6]) /* ty=Tensor[(10, 10), float32] */;
  %10 = add(%8, %9) /* ty=Tensor[(10, 10), float32] */;
  %11 = annotation.compiler_begin(%w4, meta[relay.attrs.CompilerAttrs][7]) /* ty=Tensor[(10, 10), float32] */;
  %12 = subtract(%10, %11) /* ty=Tensor[(10, 10), float32] */;
  %13 = annotation.compiler_begin(%w5, meta[relay.attrs.CompilerAttrs][8]) /* ty=Tensor[(10, 10), float32] */;
  %14 = multiply(%12, %13) /* ty=Tensor[(10, 10), float32] */;
  %15 = annotation.compiler_end(%14, meta[relay.attrs.CompilerAttrs][9]) /* ty=Tensor[(10, 10), float32] */;
  %16 = add(%x, %w6) /* ty=Tensor[(10, 10), float32] */;
  %17 = subtract(%16, %w7) /* ty=Tensor[(10, 10), float32] */;
  %18 = (%7, %15, %17);
  concatenate(%18) /* ty=Tensor[(30, 10), float32] */
}

After

def @main(%x: Tensor[(10, 10), float32], %w0: Tensor[(10, 10), float32], %w1: Tensor[(10, 10), float32], %w2: Tensor[(10, 10), float32], %w3: Tensor[(10, 10), float32], %w4: Tensor[(10, 10), float32], %w5: Tensor[(10, 10), float32], %w6: Tensor[(10, 10), float32], %w7: Tensor[(10, 10), float32]) -> Tensor[(30, 10), float32] {
  %2 = fn (%ccompiler_input0: Tensor[(10, 10), float32], %ccompiler_input1: Tensor[(10, 10), float32], %ccompiler_input2: Tensor[(10, 10), float32], %ccompiler_input3: Tensor[(10, 10), float32], Compiler="ccompiler", ExternalSymbol="ccompiler_0", Primitive=1) -> Tensor[(10, 10), float32] {
    %0 = add(%ccompiler_input0, %ccompiler_input1) /* ty=Tensor[(10, 10), float32] */;
    %1 = subtract(%0, %ccompiler_input2) /* ty=Tensor[(10, 10), float32] */;
    multiply(%1, %ccompiler_input3) /* ty=Tensor[(10, 10), float32] */
  };
  %3 = %2(%x, %w0, %w1, %w2) /* ty=Tensor[(10, 10), float32] */;
  %6 = fn (%ccompiler_input4: Tensor[(10, 10), float32], %ccompiler_input5: Tensor[(10, 10), float32], %ccompiler_input6: Tensor[(10, 10), float32], %ccompiler_input7: Tensor[(10, 10), float32], Compiler="ccompiler", ExternalSymbol="ccompiler_1", Primitive=1) -> Tensor[(10, 10), float32] {
    %4 = add(%ccompiler_input4, %ccompiler_input5) /* ty=Tensor[(10, 10), float32] */;
    %5 = subtract(%4, %ccompiler_input6) /* ty=Tensor[(10, 10), float32] */;
    multiply(%5, %ccompiler_input7) /* ty=Tensor[(10, 10), float32] */
  };
  %7 = %6(%x, %w3, %w4, %w5) /* ty=Tensor[(10, 10), float32] */;
  %8 = add(%x, %w6) /* ty=Tensor[(10, 10), float32] */;
  %9 = subtract(%8, %w7) /* ty=Tensor[(10, 10), float32] */;
  %10 = (%3, %7, %9);
  concatenate(%10) /* ty=Tensor[(30, 10), float32] */
}

python/tvm/relay/op/contrib/ccompiler/__init__.py

python/tvm/relay/op/contrib/ccompiler/annotate_compiler.py

python/tvm/relay/op/contrib/dnnl/__init__.py

src/relay/pass/annotate_compiler.cc

src/relay/pass/partition_graph.cc

masahi · 2020-01-05T02:44:38Z

can we merge this? @tqchen @jroesch @icemelon9

tqchen · 2020-01-05T04:31:51Z

It would be great to think a bit more about the API. Personally, I don't like the fact that we are adding new build functions (build_extern_compiler). We should advocate the pipeline API more and avoid the general interface.

Given that this is not our final form, I would suggest we only keep the pipeline API. e.g.

mod = relay.transform.AnnotateExternCompiler("dnnl")(mod)

Similarly, the term register_annotate_compiler seems to be quite arbitrary. Is it mainly built for dnnl? should it have a dnnl namespace instead, should we just make use of set_attr for now?

tqchen · 2020-01-05T04:37:49Z

So to summary my concerns. Extern compilation support is definitely an important feature, how to design the related API will have a quite big impact for our users.

While I think the related PRs have the techincal solution to deliver "a solution" for our problem. It would be really nice if we can bring a few API design options(e.g. choices of build_extern_compiler vs AnnotateCompiler, the choice of register_annotate_compiler, their signatures) as RFC discussion, this way we can make sure we design the most easy to use APIs for our developers.

How about open a new discuss thread that contains those choices and try to get people's opinion on the choices?

zhiics · 2020-01-05T06:03:15Z

@tqchen Opened here: https://discuss.tvm.ai/t/rfc-naming-and-api-signature-for-external-compilation/5292

zhiics · 2020-01-13T06:05:41Z

As discussed in https://discuss.tvm.ai/t/rfc-naming-and-api-signature-for-external-compilation/5292, let's remove the annotation template for now. Instead, we only focus on partitioning and expect developers to annotate the program first.

Some example annotators that leverage the pass manager are added to help developers write their own annotator.

Followup PRs will be sent to help annotation which may need more discussion and thoughts on how to combine together with OpStrategy.

comaniac

LGTM

zhiics · 2020-01-14T03:16:56Z

@tqchen We removed the annotation part, any other outstanding concerns?

Co-Authored-By: 雾雨魔理沙 <lolisa@marisa.moe>

zhiics · 2020-01-14T19:40:23Z

Thanks @tqchen @masahi @comaniac

…che#4570) * [relay] Relay annotation and partitioning for codegen * Add fusion unit test * fix comments * Update include/tvm/relay/attrs/annotation.h Co-Authored-By: 雾雨魔理沙 <lolisa@marisa.moe> * rebase * remove annotation helper * rebase again Co-authored-by: Cody Yu <comaniac0422@gmail.com> Co-authored-by: 雾雨魔理沙 <lolisa@marisa.moe>

zhiics force-pushed the ext_annotation_partition branch from 0c694d1 to 4c942c0 Compare December 25, 2019 06:22