Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][TVM] Bring Your Own Codegen to TVM #4258

Closed
wants to merge 34 commits into from

Conversation

zhiics
Copy link
Member

@zhiics zhiics commented Nov 5, 2019

This is a WIP that enables different backends and/or hardware vendors to bring their own codegen tools to TVM. This is the collaboration between @comaniac and me. @jroesch also provided lots of suggestions in the initial design. The RFC can be found here: https://discuss.tvm.ai/t/bring-your-own-codegen-to-tvm/4501/27

Some high-level design and APIs involve the following parts:

  • Graph coloring/annotation
    Providing HW vendors an infra to customize where they want to execute an op.
    Two possible ways are allowed to annotate a graph:

    • Custom pass: users can write a Relay pass to decide how they want to partition graph using (subgraph_begin and subgraph_end annotations). For example, more sophisticated algorithm could be implemented to annotate the groups of operators.
    • A high-level API is used to help user/vendors to enable a convenient integration
      @reg.register_extern_op("nn.conv2d")
      def conv2d(attrs, args, comp):
          return get_extern_op(comp, "conv2d")(attrs, args)
      Each codegen only needs provide the supported operators and they can invoke a separate build pipeline, e.g. build_extern, to invoke partitioning. On the completion of this pipeline, the operators that will be offloaded is wrapped with subgraph_start and subgraph_end annotation. The annotated program will then be sent to the normal build pipeline for code and artifacts generation.
      @reg.register_extern_op("nn.conv2d")
      def conv2d(attrs, args, comp):
          return get_extern_op(comp, "conv2d")(attrs, args)
  • Graph partitioning
    It is a Relay pass that partitions a program into segments that could be executed on various hardware platforms based on the annotations. The current implementation has not fused consecutive subgraphs belonging to the same backend together yet. It will be handled by follow-up PRs.

  • Code generation
    Generate code for each segment of a partition Relay program. Each codegen tool is wrapped into a runtime module so that we can leverage the current TVM infra to do serialization and runtime invocation.

FYI, we currently used GCC as an external codegen tool for easy prototyping and verification. It should be removed later when we land it.

cc @tqchen @yzhliu @wweic @broune @soiferj @junrushao1994 @icemelon9

@tqchen
Copy link
Member

tqchen commented Nov 5, 2019

Thanks for the PR.

It would be great if you can propose a separate PR for the runtime/contrib support. Given that the runtime itself can be quite interesting and we want to have a clear guide for it.

The main problem I see in the current PR is that that the serialization is not implemented.

You can run things through relay.execute because the runtime is created on the fly. However, you cannot save the module, because SaveBinary and load method is not implemented for DNNLModule.

Moreover, if we are only generating c code, I think a better way would be to reuse DSOModule, e.g. to generate wrapper functions in C that directly adopts the PackedFunc calling convention in the DSOModule, then the code can be collectively compiled with the original source, and still exposes these functions. Have a specific shell code Module(DNNL, and GCC) adds additional duplications that we do not need.

A better example I have in mind could be something like a subgraph sequence in a serialized format(which can be used in save-binary), and then the PackedFunc interprets the graph and run things. This will likely applies to settings like TensorRT, and TF, although you can always generate a sequence of c code into low level libs. The subgraph or special device related blob serialization would be more relevant to accelerators

@tqchen tqchen self-assigned this Nov 5, 2019
@zhiics
Copy link
Member Author

zhiics commented Nov 5, 2019

@tqchen Thanks for comment:) We actually tried something similar to the DSOModule you mentioned here. @comaniac can you probably share a bit more about it. Anyways, let me take a look at it and see if we missed something.

@comaniac
Copy link
Contributor

comaniac commented Nov 5, 2019

I agree that our base module has many similarities to the DSOMoulde. Maybe we can consider to directly base on the DSOModule and keep the functions overridable in case users want to use other forms of serilizations.

@tqchen
Copy link
Member

tqchen commented Nov 5, 2019

To be specific, if we want to make uses of the shared library, what I will do is to generate the redirection code (like the current CodeGenC)

extern "C" int my_function (TVMArgs* args, TVMTypeCode* tcode, int len) {
    void* handle = args[0].v_handle
    // call into the real my_function
}

Then we can compile the file and link together with other parts.

For modules that needs its own serialization(e.g. has customized graph format) we don't have to subclass from the DSO, but can directly subclass from Module and in the PackedFunc walked through the data structures to do function calls. I think we need an example that is related to this kind, because it is closer to what people need when connecting to a customized nn runtime

@zhiics
Copy link
Member Author

zhiics commented Nov 5, 2019

@tqchen Yes, we are doing something similar. We generate the C APIs directly and compile them into a .so file so that the runtime module can load it. We didn't generate the wrapper like what you provide above. Instead, we generate the real function call for a subgraph, i.e.

void foo(float* a, int N, float* b, int M, float* c) {
   bar(...);
   foobar(...);
}

This foo API is generated through the Relay ExprVisitor and it will be invoked using GetPackedFunction. The NDArray to float* conversion is currently done in GetPackedFunction instead of foo.

include/tvm/runtime/vm.h Outdated Show resolved Hide resolved
# Available externs:
# gcc
# dnnl
set(USE_EXTERN none)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the number of external compilers and runtimes we may have, I think it's better for each one to have its own field in config.cmake. For example, USE_GCC or USE_DNNL. With this, we can be more flexible with our options for finding the compiler / runtime. For example, USE_GCC=ON vs USE_GCC=<path to gcc>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they should be prefixed as USE_EXTERNAL_ to make it clear this is part of the external integration if we go down that path.

include/tvm/relay/attrs/annotation.h Show resolved Hide resolved
"""Check if the external codegen should be used.
FIXME: Turn off due to not support of multiple outputs.
"""
return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a purpose to returning false? Could this function just be removed?

python/tvm/relay/op/op.py Outdated Show resolved Hide resolved
src/relay/pass/extern_op.cc Show resolved Hide resolved
@tqchen
Copy link
Member

tqchen commented Nov 6, 2019

OK, please try to send another PR with a mini customized runtime that loads in something like a graph or steps of dnnl calls, and implements save/load binary. This will help resolve the confusion that @soiferj has on this PR. It would be great if the runtime PR is compiler independent, something like graph runtime test, where we manually construct("compile") the necessary data structures.

For the C dll library, please generate the shell code that directly interface with DSO module, so we don't have to define another dll loader.

@zhiics
Copy link
Member Author

zhiics commented Nov 6, 2019

Sure, let's give it a try to send the runtime part first.

*
* \return The pass.
*/
TVM_DLL Pass PartitionGraph();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move away from graph terminology we should start to emphasize that we have more than a data-flow graph, this has led people to avoid scoping, effects, etc

namespace relay {
namespace contrib {

class ExternCodegenBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the external interface sit in contrib?

Copy link
Member

@jroesch jroesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a bunch of comments, thanks for taking the prototype I wrote to the finish line :)

@zhiics
Copy link
Member Author

zhiics commented Nov 11, 2019

@jroesch @soiferj Thanks for comments. We will come back to fix them once the runtime part is done.

Copy link
Contributor

@u99127 u99127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only had time to review the tutorial tonight but before anything else - this is a very interesting PR and would need more reviews and iterations with some prototyping. I've had a quick read through and there are some obvious changes that I think should be fixed up and some questions about the integration.

Ramana

tutorials/dev/custom_relay_backend.py Outdated Show resolved Hide resolved
tutorials/dev/custom_relay_backend.py Outdated Show resolved Hide resolved
Comment on lines +258 to +264
# if(_gcc_idx GREATER -1)
# file(GLOB GCC_RELAY_CONTRIB_SRC src/relay/backend/contrib/gcc/codegen.cc)
# list(APPEND COMPILER_SRCS ${GCC_RELAY_CONTRIB_SRC})
# file(GLOB GCC_CONTRIB_SRC src/runtime/contrib/gcc/*.cc)
# list(APPEND RUNTIME_SRCS ${GCC_CONTRIB_SRC})
# message(STATUS "Use extern library: GCC")
# endif()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can more than one such contrib codegen paths exist in the source base ? I presume as long as the parameter to FIND_USE_EXTERN is unique , that's ok ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can have more codegen paths exist in the source base.
The current prototype allows you to specify a list in FIND_USE_EXTERN to enable more than one external backends, but we are still discussing the best way for users.

# Finally, we include the implemented codegen to the cmake config so that
# it will be built along with the TVM. In cmake/modules/contrib/Extern.cmake:
#
# list(FIND USE_EXTERN "gcc" _gcc_idx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of FIND_USE_EXTERN here ? the name isn't obvious , neither can I find any comment around it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a CMake workaround in order to check if "gcc" is the list of USE_EXTERN. In the later version of CMake we can simply use the keyword in like Python, but it is not allowed in the old CMake. Again since we haven't confirmed the best way of enabling the external backend, this is just a prototype.

tutorials/dev/custom_relay_backend.py Outdated Show resolved Hide resolved
tutorials/dev/custom_relay_backend.py Outdated Show resolved Hide resolved
######################################################################
# Define The Supported Operators
# ------------------------------
# The first step is to define which operators are supported by your backend.
Copy link
Contributor

@u99127 u99127 Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we support any dialect operators as well ? Should we make that explicit ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The custom op will not have a Relay mapping so it is not recognizable here.
It seems to me that if an external backend can support some ops that Relay/TVM cannot support, we should at least support them first in Relay/TVM. But we can discuss more about this in the RFC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, relay can additionally define dialects, for e.g. one has the qnn ops as a dialect on top of relay . Can I match qnn.conv2d in these rules for instance ?

@comaniac
Copy link
Contributor

@u99127 thanks for th comments.

As we are making another PR #4280 for the runtime and will refine this PR accordingly after #4280 has been merged, I would suggest to review #4280 first for the implementation details of the runtime support.

@@ -589,6 +597,25 @@ std::string AsText(const NodeRef& node,
bool show_meta_data = true,
runtime::TypedPackedFunc<std::string(Expr)> annotate = nullptr);

/*! \brief namespace of the attributes that are attached to a function. */
namespace attr {
/*! \brief Mark the function as a primitive function. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we mean by a "primitive" function ?

######################################################################
# Define The Supported Operators
# ------------------------------
# The first step is to define which operators are supported by your backend.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, relay can additionally define dialects, for e.g. one has the qnn ops as a dialect on top of relay . Can I match qnn.conv2d in these rules for instance ?

@u99127
Copy link
Contributor

u99127 commented Nov 13, 2019

@comaniac - I'll have a look at #4280 , thanks.

@comaniac
Copy link
Contributor

Hi @tqchen,

Based on the runtime PR, we are now back to the approach of building external runtime module and integrate it to the DSO module. Specifically, when users invoke build and their external codegen attempts to generate a C source module, then we should build an extenral runtime module here. The design options are:

  1. Build an external.so file using system("g++ ...) like we've done before, load it back and import to the DSO module. The drawback is that we will have that external.so file in the disk.

  2. Invoke Clang frontend APIs to first compile the generated external code to LLVM IR, and then use llvm::parseIR like TVM LLVM backend to get an executable module. The uncertain part for this option is that we have no idea about how to accept compile flags from users.

Please let us know which option do you prefer or you have a better solution.

Thanks.

@zhiics
Copy link
Member Author

zhiics commented Nov 25, 2019

The problem we have right now is to let the DSOModule (in graph runtime) find a symbol from the imported CSourceModule and execute it. It looks that we need to generate code for the CSourceModule first which was done in the example through export_library in Python. But if we want to have in memory execution, it looks that we have to be able to invoke the CSourceModule functions directly. Do we miss something?

@masahi
Copy link
Member

masahi commented Nov 26, 2019

Hi @zhiics @comaniac I'm trying this PR. I've verified that both the GCC example and DNNL example in the slide works. Can you fix the build after the rebase?

@zhiics
Copy link
Member Author

zhiics commented Nov 26, 2019

@masahi Thanks for your interest. We changed the runtime a bit. We tried to generate a DSO module directly and load it there, but now we made the external as either CSourceModule or a simple JSONModule. We had a bit confusion on how to integrate the CSourceModule into the DSO module. We will cleanup the PR and address all comments once this is figured out.

@masahi
Copy link
Member

masahi commented Nov 26, 2019

@zhiics ok, until then I can play around with my working build (before rebase). You also need to fix build errors due to recent "Unified object protocol" initiative by tqchen.

@masahi
Copy link
Member

masahi commented Nov 26, 2019

@zhiics sorry my earlier comment on build error was due to my corrupted environment. Build is working now.

std::vector<std::pair<std::string, int>> out_;
};

class GccCodegen : public ExternCodegenBase {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GccCodgen Codes not necessarily makes sense, as the code backend is not gcc specific (works for any c compiler)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right. The name doesn't make sense. How about CSourceCodegen?

#include "dnnl.hpp"

namespace tvm {
namespace runtime {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The header should be part of the internal header, (move to src/) as we don't want to expose them to users' of tvm

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I intentionally moved it from src to header. The reason was because the wrapper could then just directly include it and export_libaray will then take care of finding the path under include/tvm here:

https://github.com/apache/incubator-tvm/blob/279a8ebae6d507f02d904397672dc44982719645/python/tvm/_ffi/libinfo.py#L183

Otherwise, We may expect users to pass -I$PATH_TO_TVM/src/runtime/contrib

@@ -639,6 +668,35 @@ class GraphRuntimeCodegenModule : public runtime::ModuleNode {
return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {
*rv = this->output_.lowered_funcs;
});
} else if (name == "get_external_funcs") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like the current monolithic approach to handle the external compilations. Perhaps we can think a bit deeper.

@tqchen
Copy link
Member

tqchen commented Dec 4, 2019

After looking more closely at the PR again, there are a few high-level things.

We have done an iteration that removes the special handling code in the runtime, and just make use of the runtime DSO module mechanism, which is great.

There are however, still quite a lot of special codepath for code generation(GraphRuntimeCodegen), which is not a good thing. Because we have multiple variations runtime(e.g. VM).

Codegen Logic

Ideally, we want something like IRModule->Codegen -> RuntimeModule, or a collection of them. Where IRModule could contain functions with an explicit compiler annotation, so and a specific compiler is invoked. I can imagine us handling this in the compile_engine, so that the caller do not have to worry about the extern vs non-extern.

This is something that we might be able to separate out as another PR

Graph Partition and Annotation

The graph partition and annotation should be a pass that takes IRModule->IRModule, which then makes use of the data.

@zhiics
Copy link
Member Author

zhiics commented Dec 4, 2019

Codegen Logic

Ideally, we want something like IRModule->Codegen -> RuntimeModule, or a collection of them. Where IRModule could contain functions with an explicit compiler annotation, so and a specific compiler is invoked. I can imagine us handling this in the compile_engine, so that the caller do not have to worry about the extern vs non-extern.

This is something that we might be able to separate out as another PR

Thanks for pointing this out. This is also something we were trying to achieve and the external codegen (xxCodegen.cc) exactly looks like that. I agree that putting the codegen logic in GraphRuntimeCodegen is not clean. But it seems that compile_engine does not really need to do much (or even anything) for external functions. I was thinking that we can probably just have a packed function, CompileExternalFuncs (could be in compile_engine), and pass all collected external functions to it to generate runtime modules. We only need to collect these functions from GraphRuntimeCodegen and VMCompiler when traversing the AST. Does this sound good to you?

Graph Partition and Annotation

The graph partition and annotation should be a pass that takes IRModule->IRModule, which then makes use of the data.

Yes, they are IRModule->IRModule.

auto conv2d_src_md = memory::desc({conv2d_src_tz}, dt::f32, tag::any);
auto conv2d_bias_md = memory::desc({conv2d_bias_tz}, dt::f32, tag::any);
auto conv2d_weights_md = memory::desc({conv2d_weights_tz}, dt::f32, tag::any);
auto conv2d_dst_md = memory::desc({conv2d_dst_tz}, dt::f32, tag::nchw);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why hard code the dst format to nchw?

auto conv2d_dst_memory = memory(conv2d_prim_desc.dst_desc(), eng);

auto conv = convolution_forward(conv2d_prim_desc);
conv.execute(s, {{DNNL_ARG_SRC, conv2d_src_memory},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag::any is used to create primitive, so need to query the optimal format and perform reorder from nchw to optimal format. But since dst format is hard coded to nchw, most likely the optimal format here might be nchw though. The implementation here is a little strange to me.

memory::dims dst_tz = {p_B_, p_O_};

auto data_md = memory::desc{{data_tz}, dt::f32, tag::nc};
auto weight_md = memory::desc({{weight_tz}, dt::f32, tag::nc});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to be tag::io or tag::oi.

@zhiics
Copy link
Member Author

zhiics commented Jan 18, 2020

Let's close this since most of the work are merged. The annotation template will be considered separately.

@zhiics zhiics closed this Jan 18, 2020
@zhiics zhiics deleted the partitioning branch May 13, 2020 00:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants