Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BYOC][TensorRT] TensorRT BYOC integration #6395

Merged
merged 9 commits into from
Oct 19, 2020

Conversation

trevor-m
Copy link
Contributor

@trevor-m trevor-m commented Sep 3, 2020

This PR adds support for partitioning, compiling, and running the TensorRT BYOC target.

Building

There are two new cmake flags:

  • USE_TENSORRT=ON/OFF: enables TensorRT code generation - this does not require TensorRT libraries
  • USE_TENSORRT_GRAPH_RUNTIME=ON/OFF/"path/to/TensorRT": enables TensorRT runtime - this requires TensorRT libraries. A system wide install of TensorRT from a deb package or JetPack can be detected by "ON", but a .tar.gz installation requires you to provide path to the extracted TensorRT archive.

Usage

The compilation target should be "cuda" to ensure that input and output args to the TensorRT functions are placed on the GPU.

# Compilation
from tvm.relay.op.contrib import tensorrt
mod = tensorrt.partition_for_tensorrt(mod, params)
with relay.build_config(opt_level=3):
  graph, lib, params = relay.build(mod, target="cuda", params=params)
# Running inference is unchanged
mod = graph_runtime.create(graph, lib, ctx=tvm.gpu(0))
mod.run(...)

High level components

Partitioning

The annotation rules for TensorRT change depending on the version of TensorRT that is being targeted as well as the "batching mode". This can be configured with the trt_version and use_implicit_batch args of partition_for_tensorrt.

If TVM was built against the TensorRT library, the linked version is used for partitioning instead.

Codegen

This implementation using the JSONRuntime JSONSerializer base class for codegen to serialize the relay expression to a json format.

Runtime

Runtime is handled by the runtime module class in tensorrt_runtime.cc. During runtime, it first uses the TensorRTBuilder class (tensorrt_builder.cc) is used to convert the json graph to a TensorRT INetworkDefinition using TensorRT APIs. It uses the op converter classes in tensorrt_ops.cc to do this. Then, the TensorRT engine is built, this process can take up to a few minutes because TensorRT will perform its optimizations at this point. The engine is cached for further inference calls.

The runtime can be compiled against many TensorRT versions thanks to if guards. It will work for TensorRT 5, 6, and 7. However, the compiled model must have been partitioned for a TensorRT version <= the version used at runtime. Otherwise, the compiled model may expect ops to be available which require a newer version of TensorRT.

Areas I'm looking for feedback and ideas

  1. TensorRT has parameters such as max_workspace_size and use_implicit_batch which I want the user to be able to supply in partition_for_tensorrt. These parameters need to be passed along to the codegen and stored in the serialized graph until runtime. use_implicit_batch also influences the partitioning rules. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?

  2. I've implemented a transformation called prune_tensorrt_subgraphs() in python/tvm/relay/op/contrib/tensorrt.py. This is run after partitioning and allows me to decide whether to keep a subgraph or return it back to the typical TVM compilation path. This is needed because some subgraphs could be invalid - such as when the inputs have different batch sizes or for optimization purposes if the subgraph has no multiply-accumulates. I have also implemented a general version of this in C++, but it uses the global registry to allow each codegen target to define its own is_invalid_subgraph callback. In the future we can switch to the generic version if we find a better way to register the callbacks.

  3. The targeted tensorrt version needs to be accessed during annotation. I've put it in a global variable for now.

@zhiics
Copy link
Member

zhiics commented Sep 3, 2020

@zhiics
Copy link
Member

zhiics commented Sep 3, 2020

Usually we need a separate PR to install the library/environment in the CI first (TRT in this case) so that we can have better e2e tests.

@masahi
Copy link
Member

masahi commented Sep 3, 2020

1. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?

How about using Config mechanism? I learned about this from ethos integration (thanks @mbaret) and it cleaned up my code as well. See the definition of ConfigNode below and its usage (grep for GetConfig).

https://github.com/apache/incubator-tvm/blob/30cd2302e4078b3a8787e30d70fd79e5b729ec82/src/relay/backend/contrib/ethosn/codegen_ethosn.h#L219

@trevor-m
Copy link
Contributor Author

trevor-m commented Sep 3, 2020

1. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?

How about using Config mechanism? I learned about this from ethos integration (thanks @mbaret) and it cleaned up my code as well. See the definition of ConfigNode below and its usage (grep for GetConfig).

https://github.com/apache/incubator-tvm/blob/30cd2302e4078b3a8787e30d70fd79e5b729ec82/src/relay/backend/contrib/ethosn/codegen_ethosn.h#L219

Thanks @masahi! Let me look into this.

@comaniac
Copy link
Contributor

comaniac commented Sep 3, 2020

For the rest 2 points.

  1. Is that possible to move the pass before partitioning but after merge compiler region (like PruneTesnorRTCompilerRegion)? After the merge compiler region pass you should get the Relay graph with almost the same semantic as partitioning. If you could have a pass checking each compiler region for your constraints, you can probably just remove the region you don't want, so that you should get only valid partitioned functions.

  2. Can the TensorRT version be obtained via an API call in C++? Something like tensorrt::get_version()? If so you can register a global symbol and pass the version to Python so that it can be used by the annotator.

def conv2d(...):
    if not tvm.get_global_func("relay.tensorrt.version", True):
        return False
    ver = tvm.get_global_func("relay.tensorrt.version")
    if ver == '1.0':
        return True
    return False

If you need manually set up the TensorRT version, then it could be like this: Let user specify it in config.cmake and we pass the value to a macro in C++ so that you could simply return the value. The drawback of this solution is that it needs to rebuild TVM to annotate different TensorRT versions, and I'm not sure if that makes sense to you.

@zhiics
Copy link
Member

zhiics commented Sep 3, 2020

@trevor-m @masahi for the pass config, we may not be able to obtain at the runtime though

Copy link
Contributor

@leandron leandron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch. I reviewed mostly the Python sources, and have a few comments. One thing that needs fixing it the use of print statements, to be replaced with logging.

python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
@ZihengJiang ZihengJiang added the status: need update need update based on feedbacks label Sep 9, 2020
@trevor-m
Copy link
Contributor Author

Thanks everyone for the great feedback! I've been busy lately, but I plan to start addressing the comments this week.

@trevor-m
Copy link
Contributor Author

trevor-m commented Sep 14, 2020

Thanks @comaniac!

  1. Is that possible to move the pass before partitioning but after merge compiler region (like PruneTesnorRTCompilerRegion)? After the merge compiler region pass you should get the Relay graph with almost the same semantic as partitioning. If you could have a pass checking each compiler region for your constraints, you can probably just remove the region you don't want, so that you should get only valid partitioned functions.

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?

Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

  1. Can the TensorRT version be obtained via an API call in C++? Something like tensorrt::get_version()? If so you can register a global symbol and pass the version to Python so that it can be used by the annotator. If you need manually set up the TensorRT version, then it could be like this: Let user specify it in config.cmake and we pass the value to a macro in C++ so that you could simply return the value. The drawback of this solution is that it needs to rebuild TVM to annotate different TensorRT versions, and I'm not sure if that makes sense to you.

I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution.

Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though...

@comaniac
Copy link
Contributor

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?

Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

My main concern was that it would be tedious to have a partition_graph -> revert_some_partitions flow. Also in this case, your post-processing pass depends on the partition pass and may fail along with the change of the partition pass. If this requirement is important, I'd even prefer to add post-processing feature to the partition pass that allows you to provide a packed function to check if a partitioned function is valid.

On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement.

@zhiics do you have any suggestion?

I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution.

Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though...

Looks like GetConfig does not expose to the Python side.

@trevor-m
Copy link
Contributor Author

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?
Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

My main concern was that it would be tedious to have a partition_graph -> revert_some_partitions flow. Also in this case, your post-processing pass depends on the partition pass and may fail along with the change of the partition pass. If this requirement is important, I'd even prefer to add post-processing feature to the partition pass that allows you to provide a packed function to check if a partitioned function is valid.

On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement.

@zhiics do you have any suggestion?

Thanks! That makes sense. My implementation seems tightly coupled to how PartitionGraph works.

I like the idea of adding the callback to PartitionGraph. After it puts together the function, it can check if there is a validation function registered and call it to see if it should keep the subgraph or not. Both MXNet and TF have a mechanism like this as a final check on the subgraph in their partitioning algorithms.

I agree that solving this problem is probably best done in a separate PR.

@comaniac
Copy link
Contributor

Zhi just pointed to me offline about how to access pass context configs in Python. Here is an example:

import tvm
with tvm.transform.PassContext(config={"relay.fallback_device_type": 5}):
    pass_ctx = tvm.transform.PassContext.current()
    print(pass_ctx.config["relay.fallback_device_type"])

@trevor-m
Copy link
Contributor Author

I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution.
Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though...

Looks like GetConfig does not expose to the Python side.

I see, in that case I think my current implementation using a global variable is fine since it is all confined within the one file.

Zhi just pointed to me offline about how to access pass context configs in Python. Here is an example:

import tvm
with tvm.transform.PassContext(config={"relay.fallback_device_type": 5}):
    pass_ctx = tvm.transform.PassContext.current()
    print(pass_ctx.config["relay.fallback_device_type"])

Nice! Let me try that.

@zhiics
Copy link
Member

zhiics commented Sep 14, 2020

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?
Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

My main concern was that it would be tedious to have a partition_graph -> revert_some_partitions flow. Also in this case, your post-processing pass depends on the partition pass and may fail along with the change of the partition pass. If this requirement is important, I'd even prefer to add post-processing feature to the partition pass that allows you to provide a packed function to check if a partitioned function is valid.

On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement.

@zhiics do you have any suggestion?

Yeah, I think its okay to have a refinement pass for TRT ATM since doing such a decision in the current partitioning is not easy. In the long run, we should make the partitioning pass more intelligent by taking in some configurations and partitioning over the region accordingly. Or we can consider some of the configs when merging the regions. That would need more investigation.

Copy link
Contributor

@lhutton1 lhutton1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor suggestions and comments - looks good overall, without regard to comments/concerns above.

One overall suggestion from a users perspective, it might be useful to write a beginners guide stating how to install and build with TensorRT support. Feel free to ignore though.

cmake/modules/contrib/TensorRT.cmake Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
src/runtime/contrib/tensorrt/tensorrt_builder.cc Outdated Show resolved Hide resolved
src/runtime/contrib/tensorrt/tensorrt_builder.cc Outdated Show resolved Hide resolved
src/runtime/contrib/tensorrt/tensorrt_builder.h Outdated Show resolved Hide resolved
src/runtime/contrib/tensorrt/tensorrt_runtime.cc Outdated Show resolved Hide resolved
tests/python/contrib/test_tensorrt.py Outdated Show resolved Hide resolved
tests/python/contrib/test_tensorrt.py Outdated Show resolved Hide resolved
tests/python/contrib/test_tensorrt.py Outdated Show resolved Hide resolved
@trevor-m
Copy link
Contributor Author

Just a couple of minor suggestions and comments - looks good overall, without regard to comments/concerns above.

One overall suggestion from a users perspective, it might be useful to write a beginners guide stating how to install and build with TensorRT support. Feel free to ignore though.

Thanks @lhutton1 for the review! Is docs/deploy the typical place for a guide like that?

@lhutton1
Copy link
Contributor

Np @trevor-m, it seems that way, at least that's where I added the Arm Compute Library doc

@trevor-m
Copy link
Contributor Author

There appears to be an inconsistency in the CI between tests/lint/cpplint.sh and tests/lint/clang_format.sh for this macro definition:

#define TRT_VERSION_GE(major, minor, patch)                                                    \
  ((NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
   (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))

tests/lint/cpplint.sh will complain about the 3 spaces indent on the 3rd line:

src/runtime/contrib/tensorrt/tensorrt_utils.h:35:  Weird number of spaces at line-start.  Are you using a 2-space indent?  [whitespace/indent] [3]

If I change it to 2 spaces, the cpplint passes but the clang-format checker fails:

---------clang-format log----------
diff --git a/src/runtime/contrib/tensorrt/tensorrt_utils.h b/src/runtime/contrib/tensorrt/tensorrt_utils.h
index 6d664e47d..746726fc1 100644
--- a/src/runtime/contrib/tensorrt/tensorrt_utils.h
+++ b/src/runtime/contrib/tensorrt/tensorrt_utils.h
@@ -32,7 +32,7 @@

 #define TRT_VERSION_GE(major, minor, patch)                                                    \
   ((NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
-  (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))
+   (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))

 namespace tvm {
 namespace runtime {

clang-format lint error found. Consider running clang-format-10 on these files to fix them.
script returned exit code 1

Is there a reason we have both of these checks? They appear to do the same thing but want different things. Is there a way to override one of them?

@comaniac
Copy link
Contributor

Looks like a conflict between cpplint and clang-format-10. clang-format-10 result seems more reasonable, so we may need to fix the cpplint.

cc @areusch

@comaniac
Copy link
Contributor

@t-vi looks like this is a common issue in cpplint as it only uses regular expression to lint the code. While it's not straightforward to fix cpplint, you may work around this issue by rewriting the code like

#define TRT_VERSION_GE(major, minor, patch) (                                                 \
  (NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
  (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))

So that both cpplint and clang-format are happy.

@areusch
Copy link
Contributor

areusch commented Sep 16, 2020

looks like we are on cpplint 1.4.5. I tried running it at 1.5.4, latest release, but no change. There is an option in clang-format (AlignAfterOpenBracket) to control bracket alignment style, but the google style-guide setting it's at right now I think is the most reasonable. Further, I don't think cpplint can differentiate between #define continuations and regular code, so we can't configure that there.

you could try manually fixing the line to make cpplint happy, then add on surrounding lines:
// clang-format off
// clang-format on

this seems like an edge case in cpplint, so i'd vote for that.

python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
return self.var_map[var]
return super().visit_var(var)

class SubgraphRemover(ExprMutator):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just remove the attributes of the function, e.g. inline, and then run the inline pass?

Copy link
Contributor Author

@trevor-m trevor-m Sep 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally tried that approach. However, when the tensorrt subgraphs are inlined, TVM will try to optimize the code in the tensorrt subgraphs (for example it will change conv2d to contrib_conv2d_winograd_without_weight_transform) which we don't want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the issue we discussed in this PR about how to deal with post-partitioning judgements. We could later on figure out an approach to generalize this requirement.

python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
src/runtime/contrib/tensorrt/tensorrt_runtime.cc Outdated Show resolved Hide resolved
* already built TRT engines and load into trt_engine_cache_ so they don't
* have to be built at first inference.
*/
bool GetCachedEnginesFromDisk() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we need these two serialization methods. Can we just rely on LoadFrom/SaveToBinary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building the TensorRT engine which is done on the first inference can be very slow. On edge devices it can even take up to an hour. NVIDIA provides an API to serialize/load the built TRT engine after it is built to avoid repeating this slow process.

This serialization method is separate from the LoadFrom/SaveToBinary and is there to expose TRT's engine serialization/loading API to the user so they won't have to rebuild the engine every time they load the model.

There is some more info here: https://neo-ai-dlr.readthedocs.io/en/latest/tensorrt.html#caching-tensorrt-engines

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you override the default SaveToBinary in the json runtime and optionally save the engine if one exists (and/or based on a config option)? When LoadFromBinary is called, since you have defined your own serialization method you can check for the existence of the engine and load it back. Essentially you have two different serialization/deserialization methods which you can alternate between in LoadFrom/SaveToBinary

Copy link
Contributor Author

@trevor-m trevor-m Sep 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK SaveToBinary is only ever invoked during compilation.

The engine is only built during runtime because it is specific to the target GPU and platform, so CacheEnginesToDisk needs to be performed by the runtime also.

See https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work

To optimize your model for inference, TensorRT takes your network definition, performs optimizations including platform-specific optimizations, and generates the inference engine. This process is referred to as the build phase. The build phase can take considerable time, especially when running on embedded platforms. Therefore, a typical application will build an engine once, and then serialize it as a plan file for later use.

Note: The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting discussion. I realized that this is more like a serialization for platform-dependent TensorRT engines. If it's not possible to build and serialize the engine during the compilation (or cross-compilation) even we have built the TVM with TensorRT runtime, then this is probably inevitable; otherwise we may build the engine and serialize the bit-stream along with other artifacts in SaveToBinary.

If the serialization here is inevitable, which I believe in it because users may not have TensorRT during compilation, then the next question is whether we can update the ".so" file with the serialized engine here instead of creating a separate file. In other words, the .so file may or may not contain a serialized engine, but if it has, we don't need to build it again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comaniac that is correct. The engine is platform-dependent so it is not possible to create it during the compilation, it must be done at runtime.

It think it is an interesting idea to update the .so with the built engine. I think the TVM runtime doesn't contain the necessary components to be able to serialize to .so. It could also introduce some weird behavior (you run a model on one NVIDIA device, it stores the built engine in the .so, then you take the model and try to run it on a different NVIDIA device and it wouldn't work).

This extra serialization is not required to use TRT which is why it is only exposed via an optional environment variable. It is useful for edge devices however where building the TRT engine can take up to an hour.

docs/deploy/tensorrt.rst Outdated Show resolved Hide resolved
* already built TRT engines and load into trt_engine_cache_ so they don't
* have to be built at first inference.
*/
bool GetCachedEnginesFromDisk() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you override the default SaveToBinary in the json runtime and optionally save the engine if one exists (and/or based on a config option)? When LoadFromBinary is called, since you have defined your own serialization method you can check for the existence of the engine and load it back. Essentially you have two different serialization/deserialization methods which you can alternate between in LoadFrom/SaveToBinary

@zhiics
Copy link
Member

zhiics commented Sep 18, 2020

cc @wpan11nv this might be something you or some of your Nvidia forks interested.

Copy link
Contributor

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finish reviewing the tutorial and the Python code.

CMakeLists.txt Outdated
@@ -76,6 +76,8 @@ tvm_option(USE_COREML "Build with coreml support" OFF)
tvm_option(USE_TARGET_ONNX "Build with ONNX Codegen support" OFF)
tvm_option(USE_ARM_COMPUTE_LIB "Build with Arm Compute Library" OFF)
tvm_option(USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME "Build with Arm Compute Library graph runtime" OFF)
tvm_option(USE_TENSORRT "Build with TensorRT" OFF)
Copy link
Contributor

@comaniac comaniac Sep 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is a bit confusing. USE_TENSORRT means enabling the TensorRT codegen for graph partitininog. It doesn't require TensorRT to be available in the system environment. IIUC, maybe it's better to say "Build with TensorRT codegen", although I just found that "Build with Arm Compute Library" has the same issue.

@lhutton1 could you also share your thoughts about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review Cody!

You're right, the names aren't really that clear here. Originally, I had them as USE_TENSORRT_CODEGEN for codegen only and USE_TENSORRT for both codegen and runtime. I changed them to match the ACL definitions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is confusing. I think changing to ..._CODEGEN would be a better description of what the option actually does.

python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
python/tvm/relay/op/contrib/tensorrt.py Outdated Show resolved Hide resolved
docs/deploy/tensorrt.rst Outdated Show resolved Hide resolved
docs/deploy/tensorrt.rst Outdated Show resolved Hide resolved
docs/deploy/tensorrt.rst Outdated Show resolved Hide resolved
docs/deploy/tensorrt.rst Outdated Show resolved Hide resolved
docs/deploy/tensorrt.rst Outdated Show resolved Hide resolved
@comaniac
Copy link
Contributor

Given that we are not enabling TRT codegen on CI now due to the lack of TensorRT, I suggest we bypass this issue first to get the PR merged. Meanwhile, it would be better to have a troubleshooting in the TRT codegen tutorial.

@zhiics
Copy link
Member

zhiics commented Oct 12, 2020

I agree that we can merge it first. But before that, @trevor-m could you rebase against the master and run the tests again locally to see if all of them pass? I am not sure if everything is oaky after the diagnostic error reporting was merged.

@trevor-m
Copy link
Contributor Author

I agree that we can merge it first. But before that, @trevor-m could you rebase against the master and run the tests again locally to see if all of them pass? I am not sure if everything is oaky after the diagnostic error reporting was merged.

Thanks Zhi, I ran the test_tensorrt.py locally for both codegen only and with runtime and all of the tests passed.

@zhiics
Copy link
Member

zhiics commented Oct 13, 2020

I think we can enable the test and merge after #6679 is landed since its pretty close already? Sorry for the back and forth.

@trevor-m
Copy link
Contributor Author

I think we can enable the test and merge after #6679 is landed since its pretty close already?

Agreed, makes sense. I'll renable test once #6679 is merged.

@tqchen
Copy link
Member

tqchen commented Oct 13, 2020

Notably we will need to wait until the docker image is updated instead just the PR merge. I believe @jroesch might be working on an image update that we can let him chime in once it lands. hopefully not blocking it for too long. We can also land and re-enable later

Trevor Morris added 8 commits October 19, 2020 17:48
Support input nodes with multiple data entries

Fix failing tests

Support layout transform, add engine caching

Add comment

Add PruneSubgraph pass

Use prune_subgraph pass, make params member of trt runtime class

Hide deprecation warnings coming from TRT headers

Remove general prune subgraph

Save/load use_implicit_batch and workspace size

Clean up

Fix cpp lint

Addressing review comments

Refactor tests

Use relay.bind instead of VarReplacer. Improve some annotation functions

Add TRT docs

Use DLOG, formatting

Use logging.info instead of print

also  refactor integ tests

also  refactor integ tests

Formatting

Formatting

Format python

fix python format

Fix pylint

Fix sphinx precheck

Add tensorrt.rst to toctree

Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI

linty

Address more comments

Formatting

Formatting
@trevor-m
Copy link
Contributor Author

CI has passed with USE_TENSORRT_CODEGEN ON since the new CI container is used.

@zhiics zhiics merged commit af8636a into apache:main Oct 19, 2020
@zhiics
Copy link
Member

zhiics commented Oct 19, 2020

Thanks @trevor-m @comaniac @lhutton1 @leandron

@zhiics zhiics added status: accepted and removed status: need review status: need update need update based on feedbacks labels Oct 19, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Oct 29, 2020
* TensorRT integration using JSONRuntime

Support input nodes with multiple data entries

Fix failing tests

Support layout transform, add engine caching

Add comment

Add PruneSubgraph pass

Use prune_subgraph pass, make params member of trt runtime class

Hide deprecation warnings coming from TRT headers

Remove general prune subgraph

Save/load use_implicit_batch and workspace size

Clean up

Fix cpp lint

Addressing review comments

Refactor tests

Use relay.bind instead of VarReplacer. Improve some annotation functions

Add TRT docs

Use DLOG, formatting

Use logging.info instead of print

also  refactor integ tests

also  refactor integ tests

Formatting

Formatting

Format python

fix python format

Fix pylint

Fix sphinx precheck

Add tensorrt.rst to toctree

Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI

linty

Address more comments

Formatting

Formatting

* Documentation changes

* Address comments

* Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME

* Fix comment typo

* Test CI without TRT codegen enabled

* formatting

* Enable USE_TENSORRT_CODEGEN in CI

* Change file_util.h -> file_utils.h
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 2, 2020
* TensorRT integration using JSONRuntime

Support input nodes with multiple data entries

Fix failing tests

Support layout transform, add engine caching

Add comment

Add PruneSubgraph pass

Use prune_subgraph pass, make params member of trt runtime class

Hide deprecation warnings coming from TRT headers

Remove general prune subgraph

Save/load use_implicit_batch and workspace size

Clean up

Fix cpp lint

Addressing review comments

Refactor tests

Use relay.bind instead of VarReplacer. Improve some annotation functions

Add TRT docs

Use DLOG, formatting

Use logging.info instead of print

also  refactor integ tests

also  refactor integ tests

Formatting

Formatting

Format python

fix python format

Fix pylint

Fix sphinx precheck

Add tensorrt.rst to toctree

Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI

linty

Address more comments

Formatting

Formatting

* Documentation changes

* Address comments

* Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME

* Fix comment typo

* Test CI without TRT codegen enabled

* formatting

* Enable USE_TENSORRT_CODEGEN in CI

* Change file_util.h -> file_utils.h
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
* TensorRT integration using JSONRuntime

Support input nodes with multiple data entries

Fix failing tests

Support layout transform, add engine caching

Add comment

Add PruneSubgraph pass

Use prune_subgraph pass, make params member of trt runtime class

Hide deprecation warnings coming from TRT headers

Remove general prune subgraph

Save/load use_implicit_batch and workspace size

Clean up

Fix cpp lint

Addressing review comments

Refactor tests

Use relay.bind instead of VarReplacer. Improve some annotation functions

Add TRT docs

Use DLOG, formatting

Use logging.info instead of print

also  refactor integ tests

also  refactor integ tests

Formatting

Formatting

Format python

fix python format

Fix pylint

Fix sphinx precheck

Add tensorrt.rst to toctree

Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI

linty

Address more comments

Formatting

Formatting

* Documentation changes

* Address comments

* Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME

* Fix comment typo

* Test CI without TRT codegen enabled

* formatting

* Enable USE_TENSORRT_CODEGEN in CI

* Change file_util.h -> file_utils.h
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
* TensorRT integration using JSONRuntime

Support input nodes with multiple data entries

Fix failing tests

Support layout transform, add engine caching

Add comment

Add PruneSubgraph pass

Use prune_subgraph pass, make params member of trt runtime class

Hide deprecation warnings coming from TRT headers

Remove general prune subgraph

Save/load use_implicit_batch and workspace size

Clean up

Fix cpp lint

Addressing review comments

Refactor tests

Use relay.bind instead of VarReplacer. Improve some annotation functions

Add TRT docs

Use DLOG, formatting

Use logging.info instead of print

also  refactor integ tests

also  refactor integ tests

Formatting

Formatting

Format python

fix python format

Fix pylint

Fix sphinx precheck

Add tensorrt.rst to toctree

Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI

linty

Address more comments

Formatting

Formatting

* Documentation changes

* Address comments

* Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME

* Fix comment typo

* Test CI without TRT codegen enabled

* formatting

* Enable USE_TENSORRT_CODEGEN in CI

* Change file_util.h -> file_utils.h
@ehion
Copy link

ehion commented Oct 17, 2023

i am curious how to run autotvm with tensorrt when i run

mod = partition_for_tensorrt(mod, params)
*********
tasks = autotvm.task.extract_from_program(
        mod['main'], target=target, params=params
        # , ops=(relay.op.get("nn.conv2d"),)
    )

error log :
ValueError: Cannot find global var "tvmgen_default_tensorrt_main_0" in the Module
candidates are: ["main"]

i can see a example here:tvm/tests/python/unittest/test_meta_schedule_byoc_tensorrt.py
but can't find other doc about autotvm with tensorrt support
hope to provide a doc about how to run tensorrt in autotvm,thx!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants