Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[relay] Relay annotation and partitioning for external compilers #4570

Merged
merged 7 commits into from
Jan 14, 2020

Conversation

zhiics
Copy link
Member

@zhiics zhiics commented Dec 23, 2019

This is the partitioning part of #4258

Updated

Here, we only focus on partitioning and expect developers to annotate the program first. We provide some examples to help developers to write their own annotators using pass manager directly from Python frontend.

How to use

mod = relay.testing.mobilenet.get_workload(batch_size=1, dtype='float32')
# see unit test for examples of the annotator
mod = MyAnnotator(arg1, arg2,...)(mod)
mod = relay.transform.PartitionGraph()(mod)
json, lib, param = relay.build(mod, target=target, params=params)

Followup PRs will be sent to:

  • provide a convenient annotation framework which may need more discussion and thoughts on how to combine together with OpStrategy

  • tutorials to help developers add external codegen/runtime and create their own build pipeline.

  • add more comprehensive operator support for dnnl

CC @tqchen @jroesch @icemelon9 @u99127 @masahi @u99127 @soiferj @comaniac

@masahi
Copy link
Member

masahi commented Dec 25, 2019

@zhiics when I run test_pass_partition_graph.py, I get a warning which says

add in ccompiler is not registered. Fallback to CPU
multiply in ccompiler is not registered. Fallback to CPU
add in ccompiler is not registered. Fallback to CPU
subtract in ccompiler is not registered. Fallback to CPU

Is this expected? Or do I need to configure something on cmake?

@zhiics
Copy link
Member Author

zhiics commented Dec 25, 2019

@masahi Sorry. I mistakenly changed the name of "ccompiler" directly. It therefore cannot find the registry and fallback to cpu. Now it is fixed. Please try it again.

@masahi
Copy link
Member

masahi commented Dec 25, 2019

ok, fixed now.

I have two questions regarding how this PR would interact with other transformation passes:

  • Obviously operator fusion should be disabled for the subgraph that would be sent to external tools, but for other nodes that run on CPU I want operator fusion to work as usual. Is that possible?
  • I want subgraphs I get from this pass to be already quantized, so the quantization pass should happen before this pass. Are quantization and this pass going to work without any issues?

@zhiics
Copy link
Member Author

zhiics commented Dec 25, 2019

@masahi

Obviously operator fusion should be disabled for the subgraph that would be sent to external tools, but for other nodes that run on CPU I want operator fusion to work as usual. Is that possible?

Yes, this is possible. I just added a unit test for this. Please check it out.

I want subgraphs I get from this pass to be already quantized, so the quantization pass should happen before this pass. Are quantization and this pass going to work without any issues?

Yes, you should be able customize your opt pass pipeline. For example:

# Given a relay module, mod
with relay.quantize.qconfig(...):
    q_mod = relay.quantize.quantize(mod, params)

seq1 = [relay.transform.xx, relay.transform.yy]
with relay.build_config(...)
    mod = seq1(q_mod)

mod = relay.build_extern_compiler(mod, "ccompiler")

relaly.build_config(...):
    graph, lib, params = relay.build(mod, "llvm")

@masahi
Copy link
Member

masahi commented Dec 26, 2019

@zhiics great, thanks.

@masahi
Copy link
Member

masahi commented Dec 27, 2019

@zhiics When I run the mobilenet test and dump the annotated graph, batch norm is not converted to dnnl one.

  %150 = fn (%dnnl_input75: Tensor[(1, 1024, 7, 7), float32], %dnnl_input76: Tensor[(1024, 1, 3, 3), float32], Compiler="dnnl", ExternalSymbol="dnnl_4", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.conv2d(%dnnl_input75, %dnnl_input76, padding=[1, 1], groups=1024, channels=1024, kernel_size=[3, 3]) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };
  %151 = %150(%149, %separable_conv_block_13_depthwise_conv1_weight) /* ty=Tensor[(1, 1024, 7, 7), float32] */;
  %152 = nn.batch_norm(%151, %separable_conv_block_13_bn1_gamma, %separable_conv_block_13_bn1_beta, %separable_conv_block_13_bn1_moving_mean, %separable_conv_block_13_bn1_moving_var) /* ty=(Tensor[(1, 1024, 7, 7), float32], Tensor[(1024), float32], Tensor[(1024), float32]) */;
  %153 = %152.0;
  %154 = fn (%dnnl_input77: Tensor[(1, 1024, 7, 7), float32], Compiler="dnnl", ExternalSymbol="dnnl_3", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.relu(%dnnl_input77) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };

It is probably because batch norm is decomposed during SimplifyInference. Is batch norm supposed to be around during external codegen time? I think it should be. I don't hit "else if (IsOp(call, "nn.batch_norm")" condition here during mobilenet test.

I want to see an example of matching a pattern like Conv + BN + Relu and translate them to dnnl's fused op. It would be great if you could do that in this or future PR, otherwise I'll do it as a practice and send a PR :)

@comaniac
Copy link
Contributor

@zhiics When I run the mobilenet test and dump the annotated graph, batch norm is not converted to dnnl one.

  %150 = fn (%dnnl_input75: Tensor[(1, 1024, 7, 7), float32], %dnnl_input76: Tensor[(1024, 1, 3, 3), float32], Compiler="dnnl", ExternalSymbol="dnnl_4", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.conv2d(%dnnl_input75, %dnnl_input76, padding=[1, 1], groups=1024, channels=1024, kernel_size=[3, 3]) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };
  %151 = %150(%149, %separable_conv_block_13_depthwise_conv1_weight) /* ty=Tensor[(1, 1024, 7, 7), float32] */;
  %152 = nn.batch_norm(%151, %separable_conv_block_13_bn1_gamma, %separable_conv_block_13_bn1_beta, %separable_conv_block_13_bn1_moving_mean, %separable_conv_block_13_bn1_moving_var) /* ty=(Tensor[(1, 1024, 7, 7), float32], Tensor[(1024), float32], Tensor[(1024), float32]) */;
  %153 = %152.0;
  %154 = fn (%dnnl_input77: Tensor[(1, 1024, 7, 7), float32], Compiler="dnnl", ExternalSymbol="dnnl_3", Primitive=1) -> Tensor[(1, 1024, 7, 7), float32] {
    nn.relu(%dnnl_input77) /* ty=Tensor[(1, 1024, 7, 7), float32] */
  };

The BN op should be turned off in DNNL codegen because we haven't supported multiple outputs yet.

It is probably because batch norm is decomposed during SimplifyInference. Is batch norm supposed to be around during external codegen time? I think it should be. I don't hit "else if (IsOp(call, "nn.batch_norm")" condition here during mobilenet test.

The BN op will not be decomposed if it is marked as an external op.

I want to see an example of matching a pattern like Conv + BN + Relu and translate them to dnnl's fused op. It would be great if you could do that in this or future PR, otherwise I'll do it as a practice and send a PR :)
Op fusion is not considered in this PR. It should be follow-up work after this PR is merged.

@masahi
Copy link
Member

masahi commented Dec 27, 2019

If BN is not decomposed, why am I not seeing dnnl_bn generated?

@comaniac
Copy link
Contributor

comaniac commented Dec 27, 2019

If BN is not decomposed, why am I not seeing dnnl_bn generated?

It is not marked as external because batch_norm is always False in https://github.com/apache/incubator-tvm/pull/4570/files#diff-245f59aa1968d2efdb8357cfbdf149c9R40-R44

@masahi
Copy link
Member

masahi commented Dec 27, 2019

@comaniac ok thanks, I'll study the code.

@comaniac
Copy link
Contributor

BTW, the batch_norm is also set to not decompose here: https://github.com/apache/incubator-tvm/blob/master/src/relay/pass/fuse_ops.cc#L243. This is not a general solution so we also plan to improve it later.

@masahi
Copy link
Member

masahi commented Dec 27, 2019

Ideally whether to keep batch norm or not should be configurable. Actually for my use case, I want batch norm to be decomposed and its multiplication folded into previous conv, so that I don't have to worry about supporting batch norm on my end. But I expect others have a special op for handling fused Conv + BN + Relu.

UPDATE: I realized this is already achieved by allowing each backend to have its own annotate_compiler registration.

@comaniac
Copy link
Contributor

Ideally whether to keep batch norm or not should be configurable. Actually for my use case, I want batch norm to be decomposed and its multiplication folded into previous conv, so that I don't have to worry about supporting batch norm on my end. But I expect others have a special op for handling fused Conv + BN + Relu.

UPDATE: I realized this is already achieved by allowing each backend to have its own annotate_compiler registration.

Yes you are right. This PR provides an interface (annotation) for specialized codegen to do subgraph pattern matching for fusion or similar optimization. However, we also realize that this may be painful for developers and it might be better to provide a more convenient pattern matching for widely used cases (e.g., Conv+BN+ReLU as you mentioned). Our follow-up plans to focus more on the graph partitioning algorithm, and this is one of the cases we want to cover.

@masahi
Copy link
Member

masahi commented Dec 27, 2019

For others who might be interested in this PR, here is a before/after of partitioning in test_multi_node_compiler().

Before

fn (%x: Tensor[(10, 10), float32], %w0: Tensor[(10, 10), float32], %w1: Tensor[(10, 10), float32], %w2: Tensor[(10, 10), float32], %w3: Tensor[(10, 10), float32], %w4: Tensor[(10, 10), float32], %w5: Tensor[(10, 10), float32], %w6: Tensor[(10, 10), float32], %w7: Tensor[(10, 10), float32]) -> Tensor[(30, 10), float32] {
  %0 = annotation.compiler_begin(%x, meta[relay.attrs.CompilerAttrs][0]) /* ty=Tensor[(10, 10), float32] */;
  %1 = annotation.compiler_begin(%w0, meta[relay.attrs.CompilerAttrs][1]) /* ty=Tensor[(10, 10), float32] */;
  %2 = add(%0, %1) /* ty=Tensor[(10, 10), float32] */;
  %3 = annotation.compiler_begin(%w1, meta[relay.attrs.CompilerAttrs][2]) /* ty=Tensor[(10, 10), float32] */;
  %4 = subtract(%2, %3) /* ty=Tensor[(10, 10), float32] */;
  %5 = annotation.compiler_begin(%w2, meta[relay.attrs.CompilerAttrs][3]) /* ty=Tensor[(10, 10), float32] */;
  %6 = multiply(%4, %5) /* ty=Tensor[(10, 10), float32] */;
  %7 = annotation.compiler_end(%6, meta[relay.attrs.CompilerAttrs][4]) /* ty=Tensor[(10, 10), float32] */;
  %8 = annotation.compiler_begin(%x, meta[relay.attrs.CompilerAttrs][5]) /* ty=Tensor[(10, 10), float32] */;
  %9 = annotation.compiler_begin(%w3, meta[relay.attrs.CompilerAttrs][6]) /* ty=Tensor[(10, 10), float32] */;
  %10 = add(%8, %9) /* ty=Tensor[(10, 10), float32] */;
  %11 = annotation.compiler_begin(%w4, meta[relay.attrs.CompilerAttrs][7]) /* ty=Tensor[(10, 10), float32] */;
  %12 = subtract(%10, %11) /* ty=Tensor[(10, 10), float32] */;
  %13 = annotation.compiler_begin(%w5, meta[relay.attrs.CompilerAttrs][8]) /* ty=Tensor[(10, 10), float32] */;
  %14 = multiply(%12, %13) /* ty=Tensor[(10, 10), float32] */;
  %15 = annotation.compiler_end(%14, meta[relay.attrs.CompilerAttrs][9]) /* ty=Tensor[(10, 10), float32] */;
  %16 = add(%x, %w6) /* ty=Tensor[(10, 10), float32] */;
  %17 = subtract(%16, %w7) /* ty=Tensor[(10, 10), float32] */;
  %18 = (%7, %15, %17);
  concatenate(%18) /* ty=Tensor[(30, 10), float32] */
}

After

def @main(%x: Tensor[(10, 10), float32], %w0: Tensor[(10, 10), float32], %w1: Tensor[(10, 10), float32], %w2: Tensor[(10, 10), float32], %w3: Tensor[(10, 10), float32], %w4: Tensor[(10, 10), float32], %w5: Tensor[(10, 10), float32], %w6: Tensor[(10, 10), float32], %w7: Tensor[(10, 10), float32]) -> Tensor[(30, 10), float32] {
  %2 = fn (%ccompiler_input0: Tensor[(10, 10), float32], %ccompiler_input1: Tensor[(10, 10), float32], %ccompiler_input2: Tensor[(10, 10), float32], %ccompiler_input3: Tensor[(10, 10), float32], Compiler="ccompiler", ExternalSymbol="ccompiler_0", Primitive=1) -> Tensor[(10, 10), float32] {
    %0 = add(%ccompiler_input0, %ccompiler_input1) /* ty=Tensor[(10, 10), float32] */;
    %1 = subtract(%0, %ccompiler_input2) /* ty=Tensor[(10, 10), float32] */;
    multiply(%1, %ccompiler_input3) /* ty=Tensor[(10, 10), float32] */
  };
  %3 = %2(%x, %w0, %w1, %w2) /* ty=Tensor[(10, 10), float32] */;
  %6 = fn (%ccompiler_input4: Tensor[(10, 10), float32], %ccompiler_input5: Tensor[(10, 10), float32], %ccompiler_input6: Tensor[(10, 10), float32], %ccompiler_input7: Tensor[(10, 10), float32], Compiler="ccompiler", ExternalSymbol="ccompiler_1", Primitive=1) -> Tensor[(10, 10), float32] {
    %4 = add(%ccompiler_input4, %ccompiler_input5) /* ty=Tensor[(10, 10), float32] */;
    %5 = subtract(%4, %ccompiler_input6) /* ty=Tensor[(10, 10), float32] */;
    multiply(%5, %ccompiler_input7) /* ty=Tensor[(10, 10), float32] */
  };
  %7 = %6(%x, %w3, %w4, %w5) /* ty=Tensor[(10, 10), float32] */;
  %8 = add(%x, %w6) /* ty=Tensor[(10, 10), float32] */;
  %9 = subtract(%8, %w7) /* ty=Tensor[(10, 10), float32] */;
  %10 = (%3, %7, %9);
  concatenate(%10) /* ty=Tensor[(30, 10), float32] */
}

@masahi
Copy link
Member

masahi commented Jan 5, 2020

can we merge this? @tqchen @jroesch @icemelon9

@tqchen
Copy link
Member

tqchen commented Jan 5, 2020

It would be great to think a bit more about the API. Personally, I don't like the fact that we are adding new build functions (build_extern_compiler). We should advocate the pipeline API more and avoid the general interface.

Given that this is not our final form, I would suggest we only keep the pipeline API. e.g.

mod = relay.transform.AnnotateExternCompiler("dnnl")(mod)

Similarly, the term register_annotate_compiler seems to be quite arbitrary. Is it mainly built for dnnl? should it have a dnnl namespace instead, should we just make use of set_attr for now?

@tqchen
Copy link
Member

tqchen commented Jan 5, 2020

So to summary my concerns. Extern compilation support is definitely an important feature, how to design the related API will have a quite big impact for our users.

While I think the related PRs have the techincal solution to deliver "a solution" for our problem. It would be really nice if we can bring a few API design options(e.g. choices of build_extern_compiler vs AnnotateCompiler, the choice of register_annotate_compiler, their signatures) as RFC discussion, this way we can make sure we design the most easy to use APIs for our developers.

How about open a new discuss thread that contains those choices and try to get people's opinion on the choices?

@zhiics
Copy link
Member Author

zhiics commented Jan 5, 2020

@zhiics
Copy link
Member Author

zhiics commented Jan 13, 2020

As discussed in https://discuss.tvm.ai/t/rfc-naming-and-api-signature-for-external-compilation/5292, let's remove the annotation template for now. Instead, we only focus on partitioning and expect developers to annotate the program first.

Some example annotators that leverage the pass manager are added to help developers write their own annotator.

Followup PRs will be sent to help annotation which may need more discussion and thoughts on how to combine together with OpStrategy.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhiics
Copy link
Member Author

zhiics commented Jan 14, 2020

@tqchen We removed the annotation part, any other outstanding concerns?

@zhiics zhiics merged commit 3f2abfb into apache:master Jan 14, 2020
@zhiics
Copy link
Member Author

zhiics commented Jan 14, 2020

Thanks @tqchen @masahi @comaniac

@zhiics zhiics deleted the ext_annotation_partition branch February 17, 2020 03:20
alexwong pushed a commit to alexwong/tvm that referenced this pull request Feb 26, 2020
…che#4570)

* [relay] Relay annotation and partitioning for codegen

* Add fusion unit test

* fix comments

* Update include/tvm/relay/attrs/annotation.h

Co-Authored-By: 雾雨魔理沙 <lolisa@marisa.moe>

* rebase

* remove annotation helper

* rebase again

Co-authored-by: Cody Yu <comaniac0422@gmail.com>
Co-authored-by: 雾雨魔理沙 <lolisa@marisa.moe>
alexwong pushed a commit to alexwong/tvm that referenced this pull request Feb 28, 2020
…che#4570)

* [relay] Relay annotation and partitioning for codegen

* Add fusion unit test

* fix comments

* Update include/tvm/relay/attrs/annotation.h

Co-Authored-By: 雾雨魔理沙 <lolisa@marisa.moe>

* rebase

* remove annotation helper

* rebase again

Co-authored-by: Cody Yu <comaniac0422@gmail.com>
Co-authored-by: 雾雨魔理沙 <lolisa@marisa.moe>
zhiics added a commit to neo-ai/tvm that referenced this pull request Mar 2, 2020
…che#4570)

* [relay] Relay annotation and partitioning for codegen

* Add fusion unit test

* fix comments

* Update include/tvm/relay/attrs/annotation.h

Co-Authored-By: 雾雨魔理沙 <lolisa@marisa.moe>

* rebase

* remove annotation helper

* rebase again

Co-authored-by: Cody Yu <comaniac0422@gmail.com>
Co-authored-by: 雾雨魔理沙 <lolisa@marisa.moe>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants