-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][TVM] Bring Your Own Codegen to TVM #4258
Conversation
Thanks for the PR. It would be great if you can propose a separate PR for the The main problem I see in the current PR is that that the serialization is not implemented. You can run things through relay.execute because the runtime is created on the fly. However, you cannot save the module, because SaveBinary and load method is not implemented for DNNLModule. Moreover, if we are only generating c code, I think a better way would be to reuse DSOModule, e.g. to generate wrapper functions in C that directly adopts the PackedFunc calling convention in the DSOModule, then the code can be collectively compiled with the original source, and still exposes these functions. Have a specific shell code Module(DNNL, and GCC) adds additional duplications that we do not need. A better example I have in mind could be something like a subgraph sequence in a serialized format(which can be used in save-binary), and then the PackedFunc interprets the graph and run things. This will likely applies to settings like TensorRT, and TF, although you can always generate a sequence of c code into low level libs. The subgraph or special device related blob serialization would be more relevant to accelerators |
I agree that our base module has many similarities to the DSOMoulde. Maybe we can consider to directly base on the DSOModule and keep the functions overridable in case users want to use other forms of serilizations. |
To be specific, if we want to make uses of the shared library, what I will do is to generate the redirection code (like the current CodeGenC) extern "C" int my_function (TVMArgs* args, TVMTypeCode* tcode, int len) {
void* handle = args[0].v_handle
// call into the real my_function
} Then we can compile the file and link together with other parts. For modules that needs its own serialization(e.g. has customized graph format) we don't have to subclass from the DSO, but can directly subclass from Module and in the PackedFunc walked through the data structures to do function calls. I think we need an example that is related to this kind, because it is closer to what people need when connecting to a customized nn runtime |
@tqchen Yes, we are doing something similar. We generate the C APIs directly and compile them into a .so file so that the runtime module can load it. We didn't generate the wrapper like what you provide above. Instead, we generate the real function call for a subgraph, i.e. void foo(float* a, int N, float* b, int M, float* c) {
bar(...);
foobar(...);
} This foo API is generated through the Relay |
# Available externs: | ||
# gcc | ||
# dnnl | ||
set(USE_EXTERN none) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the number of external compilers and runtimes we may have, I think it's better for each one to have its own field in config.cmake. For example, USE_GCC
or USE_DNNL
. With this, we can be more flexible with our options for finding the compiler / runtime. For example, USE_GCC=ON
vs USE_GCC=<path to gcc>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they should be prefixed as USE_EXTERNAL_ to make it clear this is part of the external integration if we go down that path.
"""Check if the external codegen should be used. | ||
FIXME: Turn off due to not support of multiple outputs. | ||
""" | ||
return False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a purpose to returning false? Could this function just be removed?
OK, please try to send another PR with a mini customized runtime that loads in something like a graph or steps of dnnl calls, and implements save/load binary. This will help resolve the confusion that @soiferj has on this PR. It would be great if the runtime PR is compiler independent, something like graph runtime test, where we manually construct("compile") the necessary data structures. For the C dll library, please generate the shell code that directly interface with DSO module, so we don't have to define another dll loader. |
Sure, let's give it a try to send the runtime part first. |
* | ||
* \return The pass. | ||
*/ | ||
TVM_DLL Pass PartitionGraph(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move away from graph terminology we should start to emphasize that we have more than a data-flow graph, this has led people to avoid scoping, effects, etc
namespace relay { | ||
namespace contrib { | ||
|
||
class ExternCodegenBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the external interface sit in contrib?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a bunch of comments, thanks for taking the prototype I wrote to the finish line :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only had time to review the tutorial tonight but before anything else - this is a very interesting PR and would need more reviews and iterations with some prototyping. I've had a quick read through and there are some obvious changes that I think should be fixed up and some questions about the integration.
Ramana
# if(_gcc_idx GREATER -1) | ||
# file(GLOB GCC_RELAY_CONTRIB_SRC src/relay/backend/contrib/gcc/codegen.cc) | ||
# list(APPEND COMPILER_SRCS ${GCC_RELAY_CONTRIB_SRC}) | ||
# file(GLOB GCC_CONTRIB_SRC src/runtime/contrib/gcc/*.cc) | ||
# list(APPEND RUNTIME_SRCS ${GCC_CONTRIB_SRC}) | ||
# message(STATUS "Use extern library: GCC") | ||
# endif() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can more than one such contrib codegen paths exist in the source base ? I presume as long as the parameter to FIND_USE_EXTERN is unique , that's ok ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes we can have more codegen paths exist in the source base.
The current prototype allows you to specify a list in FIND_USE_EXTERN
to enable more than one external backends, but we are still discussing the best way for users.
# Finally, we include the implemented codegen to the cmake config so that | ||
# it will be built along with the TVM. In cmake/modules/contrib/Extern.cmake: | ||
# | ||
# list(FIND USE_EXTERN "gcc" _gcc_idx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the meaning of FIND_USE_EXTERN here ? the name isn't obvious , neither can I find any comment around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a CMake workaround in order to check if "gcc" is the list of USE_EXTERN
. In the later version of CMake we can simply use the keyword in
like Python, but it is not allowed in the old CMake. Again since we haven't confirmed the best way of enabling the external backend, this is just a prototype.
###################################################################### | ||
# Define The Supported Operators | ||
# ------------------------------ | ||
# The first step is to define which operators are supported by your backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we support any dialect operators as well ? Should we make that explicit ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The custom op will not have a Relay mapping so it is not recognizable here.
It seems to me that if an external backend can support some ops that Relay/TVM cannot support, we should at least support them first in Relay/TVM. But we can discuss more about this in the RFC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, relay can additionally define dialects, for e.g. one has the qnn ops as a dialect on top of relay . Can I match qnn.conv2d in these rules for instance ?
@@ -589,6 +597,25 @@ std::string AsText(const NodeRef& node, | |||
bool show_meta_data = true, | |||
runtime::TypedPackedFunc<std::string(Expr)> annotate = nullptr); | |||
|
|||
/*! \brief namespace of the attributes that are attached to a function. */ | |||
namespace attr { | |||
/*! \brief Mark the function as a primitive function. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we mean by a "primitive" function ?
###################################################################### | ||
# Define The Supported Operators | ||
# ------------------------------ | ||
# The first step is to define which operators are supported by your backend. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, relay can additionally define dialects, for e.g. one has the qnn ops as a dialect on top of relay . Can I match qnn.conv2d in these rules for instance ?
Hi @tqchen, Based on the runtime PR, we are now back to the approach of building external runtime module and integrate it to the DSO module. Specifically, when users invoke
Please let us know which option do you prefer or you have a better solution. Thanks. |
The problem we have right now is to let the DSOModule (in graph runtime) find a symbol from the imported CSourceModule and execute it. It looks that we need to generate code for the CSourceModule first which was done in the example through |
@masahi Thanks for your interest. We changed the runtime a bit. We tried to generate a DSO module directly and load it there, but now we made the external as either CSourceModule or a simple JSONModule. We had a bit confusion on how to integrate the CSourceModule into the DSO module. We will cleanup the PR and address all comments once this is figured out. |
@zhiics ok, until then I can play around with my working build (before rebase). You also need to fix build errors due to recent "Unified object protocol" initiative by tqchen. |
@zhiics sorry my earlier comment on build error was due to my corrupted environment. Build is working now. |
e09d2fa
to
8d49610
Compare
std::vector<std::pair<std::string, int>> out_; | ||
}; | ||
|
||
class GccCodegen : public ExternCodegenBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GccCodgen Codes not necessarily makes sense, as the code backend is not gcc specific (works for any c compiler)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. The name doesn't make sense. How about CSourceCodegen?
#include "dnnl.hpp" | ||
|
||
namespace tvm { | ||
namespace runtime { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header should be part of the internal header, (move to src/) as we don't want to expose them to users' of tvm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I intentionally moved it from src to header. The reason was because the wrapper could then just directly include it and export_libaray
will then take care of finding the path under include/tvm
here:
Otherwise, We may expect users to pass -I$PATH_TO_TVM/src/runtime/contrib
@@ -639,6 +668,35 @@ class GraphRuntimeCodegenModule : public runtime::ModuleNode { | |||
return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) { | |||
*rv = this->output_.lowered_funcs; | |||
}); | |||
} else if (name == "get_external_funcs") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like the current monolithic approach to handle the external compilations. Perhaps we can think a bit deeper.
After looking more closely at the PR again, there are a few high-level things. We have done an iteration that removes the special handling code in the runtime, and just make use of the runtime DSO module mechanism, which is great. There are however, still quite a lot of special codepath for code generation(GraphRuntimeCodegen), which is not a good thing. Because we have multiple variations runtime(e.g. VM). Codegen LogicIdeally, we want something like IRModule->Codegen -> RuntimeModule, or a collection of them. Where IRModule could contain functions with an explicit compiler annotation, so and a specific compiler is invoked. I can imagine us handling this in the compile_engine, so that the caller do not have to worry about the extern vs non-extern. This is something that we might be able to separate out as another PR Graph Partition and AnnotationThe graph partition and annotation should be a pass that takes IRModule->IRModule, which then makes use of the data. |
Thanks for pointing this out. This is also something we were trying to achieve and the external codegen (xxCodegen.cc) exactly looks like that. I agree that putting the codegen logic in GraphRuntimeCodegen is not clean. But it seems that compile_engine does not really need to do much (or even anything) for external functions. I was thinking that we can probably just have a packed function,
Yes, they are IRModule->IRModule. |
auto conv2d_src_md = memory::desc({conv2d_src_tz}, dt::f32, tag::any); | ||
auto conv2d_bias_md = memory::desc({conv2d_bias_tz}, dt::f32, tag::any); | ||
auto conv2d_weights_md = memory::desc({conv2d_weights_tz}, dt::f32, tag::any); | ||
auto conv2d_dst_md = memory::desc({conv2d_dst_tz}, dt::f32, tag::nchw); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why hard code the dst format to nchw?
auto conv2d_dst_memory = memory(conv2d_prim_desc.dst_desc(), eng); | ||
|
||
auto conv = convolution_forward(conv2d_prim_desc); | ||
conv.execute(s, {{DNNL_ARG_SRC, conv2d_src_memory}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag::any
is used to create primitive, so need to query the optimal format and perform reorder from nchw to optimal format. But since dst format is hard coded to nchw, most likely the optimal format here might be nchw though. The implementation here is a little strange to me.
memory::dims dst_tz = {p_B_, p_O_}; | ||
|
||
auto data_md = memory::desc{{data_tz}, dt::f32, tag::nc}; | ||
auto weight_md = memory::desc({{weight_tz}, dt::f32, tag::nc}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to be tag::io
or tag::oi
.
Let's close this since most of the work are merged. The annotation template will be considered separately. |
This is a WIP that enables different backends and/or hardware vendors to bring their own codegen tools to TVM. This is the collaboration between @comaniac and me. @jroesch also provided lots of suggestions in the initial design. The RFC can be found here: https://discuss.tvm.ai/t/bring-your-own-codegen-to-tvm/4501/27
Some high-level design and APIs involve the following parts:
Graph coloring/annotation
Providing HW vendors an infra to customize where they want to execute an op.
Two possible ways are allowed to annotate a graph:
subgraph_begin
andsubgraph_end
annotations). For example, more sophisticated algorithm could be implemented to annotate the groups of operators.build_extern
, to invoke partitioning. On the completion of this pipeline, the operators that will be offloaded is wrapped withsubgraph_start
andsubgraph_end
annotation. The annotated program will then be sent to the normalbuild
pipeline for code and artifacts generation.Graph partitioning
It is a Relay pass that partitions a program into segments that could be executed on various hardware platforms based on the annotations. The current implementation has not fused consecutive subgraphs belonging to the same backend together yet. It will be handled by follow-up PRs.
Code generation
Generate code for each segment of a partition Relay program. Each codegen tool is wrapped into a runtime module so that we can leverage the current TVM infra to do serialization and runtime invocation.
FYI, we currently used GCC as an external codegen tool for easy prototyping and verification. It should be removed later when we land it.
cc @tqchen @yzhliu @wweic @broune @soiferj @junrushao1994 @icemelon9