[BYOC][TensorRT] TensorRT BYOC integration #6395

trevor-m · 2020-09-03T19:55:00Z

This PR adds support for partitioning, compiling, and running the TensorRT BYOC target.

Building

There are two new cmake flags:

USE_TENSORRT=ON/OFF: enables TensorRT code generation - this does not require TensorRT libraries
USE_TENSORRT_GRAPH_RUNTIME=ON/OFF/"path/to/TensorRT": enables TensorRT runtime - this requires TensorRT libraries. A system wide install of TensorRT from a deb package or JetPack can be detected by "ON", but a .tar.gz installation requires you to provide path to the extracted TensorRT archive.

Usage

The compilation target should be "cuda" to ensure that input and output args to the TensorRT functions are placed on the GPU.

# Compilation
from tvm.relay.op.contrib import tensorrt
mod = tensorrt.partition_for_tensorrt(mod, params)
with relay.build_config(opt_level=3):
  graph, lib, params = relay.build(mod, target="cuda", params=params)
# Running inference is unchanged
mod = graph_runtime.create(graph, lib, ctx=tvm.gpu(0))
mod.run(...)

High level components

Partitioning

The annotation rules for TensorRT change depending on the version of TensorRT that is being targeted as well as the "batching mode". This can be configured with the trt_version and use_implicit_batch args of partition_for_tensorrt.

If TVM was built against the TensorRT library, the linked version is used for partitioning instead.

Codegen

This implementation using the JSONRuntime JSONSerializer base class for codegen to serialize the relay expression to a json format.

Runtime

Runtime is handled by the runtime module class in tensorrt_runtime.cc. During runtime, it first uses the TensorRTBuilder class (tensorrt_builder.cc) is used to convert the json graph to a TensorRT INetworkDefinition using TensorRT APIs. It uses the op converter classes in tensorrt_ops.cc to do this. Then, the TensorRT engine is built, this process can take up to a few minutes because TensorRT will perform its optimizations at this point. The engine is cached for further inference calls.

The runtime can be compiled against many TensorRT versions thanks to if guards. It will work for TensorRT 5, 6, and 7. However, the compiled model must have been partitioned for a TensorRT version <= the version used at runtime. Otherwise, the compiled model may expect ops to be available which require a newer version of TensorRT.

Areas I'm looking for feedback and ideas

TensorRT has parameters such as max_workspace_size and use_implicit_batch which I want the user to be able to supply in partition_for_tensorrt. These parameters need to be passed along to the codegen and stored in the serialized graph until runtime. use_implicit_batch also influences the partitioning rules. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?
I've implemented a transformation called prune_tensorrt_subgraphs() in python/tvm/relay/op/contrib/tensorrt.py. This is run after partitioning and allows me to decide whether to keep a subgraph or return it back to the typical TVM compilation path. This is needed because some subgraphs could be invalid - such as when the inputs have different batch sizes or for optimization purposes if the subgraph has no multiply-accumulates. I have also implemented a general version of this in C++, but it uses the global registry to allow each codegen target to define its own is_invalid_subgraph callback. In the future we can switch to the generic version if we find a better way to register the callbacks.
The targeted tensorrt version needs to be accessed during annotation. I've put it in a global variable for now.

zhiics · 2020-09-03T20:05:35Z

cc @comaniac @mbaret @lhutton1 @masahi @leandron @mbrookhart

zhiics · 2020-09-03T20:25:49Z

Usually we need a separate PR to install the library/environment in the CI first (TRT in this case) so that we can have better e2e tests.

masahi · 2020-09-03T21:15:48Z

1. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?

How about using Config mechanism? I learned about this from ethos integration (thanks @mbaret) and it cleaned up my code as well. See the definition of ConfigNode below and its usage (grep for GetConfig).

https://github.com/apache/incubator-tvm/blob/30cd2302e4078b3a8787e30d70fd79e5b729ec82/src/relay/backend/contrib/ethosn/codegen_ethosn.h#L219

trevor-m · 2020-09-03T21:20:33Z

1. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?
How about using Config mechanism? I learned about this from ethos integration (thanks @mbaret) and it cleaned up my code as well. See the definition of ConfigNode below and its usage (grep for GetConfig).

https://github.com/apache/incubator-tvm/blob/30cd2302e4078b3a8787e30d70fd79e5b729ec82/src/relay/backend/contrib/ethosn/codegen_ethosn.h#L219

Thanks @masahi! Let me look into this.

comaniac · 2020-09-03T21:38:23Z

For the rest 2 points.

Is that possible to move the pass before partitioning but after merge compiler region (like PruneTesnorRTCompilerRegion)? After the merge compiler region pass you should get the Relay graph with almost the same semantic as partitioning. If you could have a pass checking each compiler region for your constraints, you can probably just remove the region you don't want, so that you should get only valid partitioned functions.
Can the TensorRT version be obtained via an API call in C++? Something like tensorrt::get_version()? If so you can register a global symbol and pass the version to Python so that it can be used by the annotator.

def conv2d(...):
    if not tvm.get_global_func("relay.tensorrt.version", True):
        return False
    ver = tvm.get_global_func("relay.tensorrt.version")
    if ver == '1.0':
        return True
    return False

If you need manually set up the TensorRT version, then it could be like this: Let user specify it in config.cmake and we pass the value to a macro in C++ so that you could simply return the value. The drawback of this solution is that it needs to rebuild TVM to annotate different TensorRT versions, and I'm not sure if that makes sense to you.

zhiics · 2020-09-03T21:43:21Z

@trevor-m @masahi for the pass config, we may not be able to obtain at the runtime though

leandron

Thanks for the patch. I reviewed mostly the Python sources, and have a few comments. One thing that needs fixing it the use of print statements, to be replaced with logging.

python/tvm/relay/op/contrib/tensorrt.py

trevor-m · 2020-09-14T19:13:22Z

Thanks everyone for the great feedback! I've been busy lately, but I plan to start addressing the comments this week.

trevor-m · 2020-09-14T20:33:31Z

Thanks @comaniac!

Is that possible to move the pass before partitioning but after merge compiler region (like PruneTesnorRTCompilerRegion)? After the merge compiler region pass you should get the Relay graph with almost the same semantic as partitioning. If you could have a pass checking each compiler region for your constraints, you can probably just remove the region you don't want, so that you should get only valid partitioned functions.

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?

Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

Can the TensorRT version be obtained via an API call in C++? Something like tensorrt::get_version()? If so you can register a global symbol and pass the version to Python so that it can be used by the annotator. If you need manually set up the TensorRT version, then it could be like this: Let user specify it in config.cmake and we pass the value to a macro in C++ so that you could simply return the value. The drawback of this solution is that it needs to rebuild TVM to annotate different TensorRT versions, and I'm not sure if that makes sense to you.

I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution.

Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though...

comaniac · 2020-09-14T21:00:06Z

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?

Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

My main concern was that it would be tedious to have a partition_graph -> revert_some_partitions flow. Also in this case, your post-processing pass depends on the partition pass and may fail along with the change of the partition pass. If this requirement is important, I'd even prefer to add post-processing feature to the partition pass that allows you to provide a packed function to check if a partitioned function is valid.

On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement.

@zhiics do you have any suggestion?

I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution.

Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though...

Looks like GetConfig does not expose to the Python side.

trevor-m · 2020-09-14T21:11:06Z

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?
Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

My main concern was that it would be tedious to have a partition_graph -> revert_some_partitions flow. Also in this case, your post-processing pass depends on the partition pass and may fail along with the change of the partition pass. If this requirement is important, I'd even prefer to add post-processing feature to the partition pass that allows you to provide a packed function to check if a partitioned function is valid.

On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement.

@zhiics do you have any suggestion?

Thanks! That makes sense. My implementation seems tightly coupled to how PartitionGraph works.

I like the idea of adding the callback to PartitionGraph. After it puts together the function, it can check if there is a validation function registered and call it to see if it should keep the subgraph or not. Both MXNet and TF have a mechanism like this as a final check on the subgraph in their partitioning algorithms.

I agree that solving this problem is probably best done in a separate PR.

comaniac · 2020-09-14T21:12:48Z

Zhi just pointed to me offline about how to access pass context configs in Python. Here is an example:

import tvm
with tvm.transform.PassContext(config={"relay.fallback_device_type": 5}):
    pass_ctx = tvm.transform.PassContext.current()
    print(pass_ctx.config["relay.fallback_device_type"])

trevor-m · 2020-09-14T21:16:53Z

I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution.
Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though...

Looks like GetConfig does not expose to the Python side.

I see, in that case I think my current implementation using a global variable is fine since it is all confined within the one file.

Zhi just pointed to me offline about how to access pass context configs in Python. Here is an example:
import tvm
with tvm.transform.PassContext(config={"relay.fallback_device_type": 5}):
    pass_ctx = tvm.transform.PassContext.current()
    print(pass_ctx.config["relay.fallback_device_type"])

Nice! Let me try that.

zhiics · 2020-09-14T21:39:33Z

Hmm, this seems like it would make the job of the PruneTensorRTSubgraph pass much more difficult. PartitionGraph already takes care of collecting the inputs and outputs of a subgraph and additional processing such as making sure there are no duplicate outputs. If PruneTesnorRTCompilerRegion was before PartitionGraph, it would have to duplicate a lot of that work. The idea of the pruning pass is that we should present each backend with the final subgraph exactly as it would be when it is passed to the codegen and the backend should decide if it is valid or not. Are you concerned about the overhead of partitioning a subgraph which would be later discarded?
Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4

My main concern was that it would be tedious to have a partition_graph -> revert_some_partitions flow. Also in this case, your post-processing pass depends on the partition pass and may fail along with the change of the partition pass. If this requirement is important, I'd even prefer to add post-processing feature to the partition pass that allows you to provide a packed function to check if a partitioned function is valid.

On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement.

@zhiics do you have any suggestion?

Yeah, I think its okay to have a refinement pass for TRT ATM since doing such a decision in the current partitioning is not easy. In the long run, we should make the partitioning pass more intelligent by taking in some configurations and partitioning over the region accordingly. Or we can consider some of the configs when merging the regions. That would need more investigation.

lhutton1

Just a couple of minor suggestions and comments - looks good overall, without regard to comments/concerns above.

One overall suggestion from a users perspective, it might be useful to write a beginners guide stating how to install and build with TensorRT support. Feel free to ignore though.

cmake/modules/contrib/TensorRT.cmake

python/tvm/relay/op/contrib/tensorrt.py

src/runtime/contrib/tensorrt/tensorrt_builder.cc

src/runtime/contrib/tensorrt/tensorrt_builder.h

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

tests/python/contrib/test_tensorrt.py

trevor-m · 2020-09-15T16:55:22Z

Just a couple of minor suggestions and comments - looks good overall, without regard to comments/concerns above.

One overall suggestion from a users perspective, it might be useful to write a beginners guide stating how to install and build with TensorRT support. Feel free to ignore though.

Thanks @lhutton1 for the review! Is docs/deploy the typical place for a guide like that?

lhutton1 · 2020-09-15T17:08:43Z

Np @trevor-m, it seems that way, at least that's where I added the Arm Compute Library doc

trevor-m · 2020-09-15T22:47:04Z

There appears to be an inconsistency in the CI between tests/lint/cpplint.sh and tests/lint/clang_format.sh for this macro definition:

#define TRT_VERSION_GE(major, minor, patch)                                                    \
  ((NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
   (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))

tests/lint/cpplint.sh will complain about the 3 spaces indent on the 3rd line:

src/runtime/contrib/tensorrt/tensorrt_utils.h:35:  Weird number of spaces at line-start.  Are you using a 2-space indent?  [whitespace/indent] [3]

If I change it to 2 spaces, the cpplint passes but the clang-format checker fails:

---------clang-format log----------
diff --git a/src/runtime/contrib/tensorrt/tensorrt_utils.h b/src/runtime/contrib/tensorrt/tensorrt_utils.h
index 6d664e47d..746726fc1 100644
--- a/src/runtime/contrib/tensorrt/tensorrt_utils.h
+++ b/src/runtime/contrib/tensorrt/tensorrt_utils.h
@@ -32,7 +32,7 @@

 #define TRT_VERSION_GE(major, minor, patch)                                                    \
   ((NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
-  (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))
+   (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))

 namespace tvm {
 namespace runtime {

clang-format lint error found. Consider running clang-format-10 on these files to fix them.
script returned exit code 1

Is there a reason we have both of these checks? They appear to do the same thing but want different things. Is there a way to override one of them?

comaniac · 2020-09-15T23:06:54Z

Looks like a conflict between cpplint and clang-format-10. clang-format-10 result seems more reasonable, so we may need to fix the cpplint.

cc @areusch

comaniac · 2020-09-15T23:28:01Z

@t-vi looks like this is a common issue in cpplint as it only uses regular expression to lint the code. While it's not straightforward to fix cpplint, you may work around this issue by rewriting the code like

#define TRT_VERSION_GE(major, minor, patch) (                                                 \
  (NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
  (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch))

So that both cpplint and clang-format are happy.

areusch · 2020-09-16T16:16:50Z

looks like we are on cpplint 1.4.5. I tried running it at 1.5.4, latest release, but no change. There is an option in clang-format (AlignAfterOpenBracket) to control bracket alignment style, but the google style-guide setting it's at right now I think is the most reasonable. Further, I don't think cpplint can differentiate between #define continuations and regular code, so we can't configure that there.

you could try manually fixing the line to make cpplint happy, then add on surrounding lines:
// clang-format off
// clang-format on

this seems like an edge case in cpplint, so i'd vote for that.

python/tvm/relay/op/contrib/tensorrt.py

zhiics · 2020-09-16T16:50:05Z

python/tvm/relay/op/contrib/tensorrt.py

+                return self.var_map[var]
+            return super().visit_var(var)
+
+    class SubgraphRemover(ExprMutator):


Can we just remove the attributes of the function, e.g. inline, and then run the inline pass?

I originally tried that approach. However, when the tensorrt subgraphs are inlined, TVM will try to optimize the code in the tensorrt subgraphs (for example it will change conv2d to contrib_conv2d_winograd_without_weight_transform) which we don't want.

This is the issue we discussed in this PR about how to deal with post-partitioning judgements. We could later on figure out an approach to generalize this requirement.

python/tvm/relay/op/contrib/tensorrt.py

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

zhiics · 2020-09-16T18:02:20Z

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

+   * already built TRT engines and load into trt_engine_cache_ so they don't
+   * have to be built at first inference.
+   */
+  bool GetCachedEnginesFromDisk() {


I am not sure if we need these two serialization methods. Can we just rely on LoadFrom/SaveToBinary?

Building the TensorRT engine which is done on the first inference can be very slow. On edge devices it can even take up to an hour. NVIDIA provides an API to serialize/load the built TRT engine after it is built to avoid repeating this slow process.

This serialization method is separate from the LoadFrom/SaveToBinary and is there to expose TRT's engine serialization/loading API to the user so they won't have to rebuild the engine every time they load the model.

There is some more info here: https://neo-ai-dlr.readthedocs.io/en/latest/tensorrt.html#caching-tensorrt-engines

Could you override the default SaveToBinary in the json runtime and optionally save the engine if one exists (and/or based on a config option)? When LoadFromBinary is called, since you have defined your own serialization method you can check for the existence of the engine and load it back. Essentially you have two different serialization/deserialization methods which you can alternate between in LoadFrom/SaveToBinary

AFAIK SaveToBinary is only ever invoked during compilation.

The engine is only built during runtime because it is specific to the target GPU and platform, so CacheEnginesToDisk needs to be performed by the runtime also.

See https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work

To optimize your model for inference, TensorRT takes your network definition, performs optimizations including platform-specific optimizations, and generates the inference engine. This process is referred to as the build phase. The build phase can take considerable time, especially when running on embedded platforms. Therefore, a typical application will build an engine once, and then serialize it as a plan file for later use.

Note: The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU.

This is an interesting discussion. I realized that this is more like a serialization for platform-dependent TensorRT engines. If it's not possible to build and serialize the engine during the compilation (or cross-compilation) even we have built the TVM with TensorRT runtime, then this is probably inevitable; otherwise we may build the engine and serialize the bit-stream along with other artifacts in SaveToBinary.

If the serialization here is inevitable, which I believe in it because users may not have TensorRT during compilation, then the next question is whether we can update the ".so" file with the serialized engine here instead of creating a separate file. In other words, the .so file may or may not contain a serialized engine, but if it has, we don't need to build it again.

Thanks @comaniac that is correct. The engine is platform-dependent so it is not possible to create it during the compilation, it must be done at runtime.

It think it is an interesting idea to update the .so with the built engine. I think the TVM runtime doesn't contain the necessary components to be able to serialize to .so. It could also introduce some weird behavior (you run a model on one NVIDIA device, it stores the built engine in the .so, then you take the model and try to run it on a different NVIDIA device and it wouldn't work).

This extra serialization is not required to use TRT which is why it is only exposed via an optional environment variable. It is useful for edge devices however where building the TRT engine can take up to an hour.

docs/deploy/tensorrt.rst

lhutton1 · 2020-09-18T09:48:27Z

src/runtime/contrib/tensorrt/tensorrt_runtime.cc

+   * already built TRT engines and load into trt_engine_cache_ so they don't
+   * have to be built at first inference.
+   */
+  bool GetCachedEnginesFromDisk() {


Could you override the default SaveToBinary in the json runtime and optionally save the engine if one exists (and/or based on a config option)? When LoadFromBinary is called, since you have defined your own serialization method you can check for the existence of the engine and load it back. Essentially you have two different serialization/deserialization methods which you can alternate between in LoadFrom/SaveToBinary

zhiics · 2020-09-18T21:27:02Z

cc @wpan11nv this might be something you or some of your Nvidia forks interested.

comaniac

Finish reviewing the tutorial and the Python code.

comaniac · 2020-09-21T22:24:26Z

CMakeLists.txt

@@ -76,6 +76,8 @@ tvm_option(USE_COREML "Build with coreml support" OFF)
 tvm_option(USE_TARGET_ONNX "Build with ONNX Codegen support" OFF)
 tvm_option(USE_ARM_COMPUTE_LIB "Build with Arm Compute Library" OFF)
 tvm_option(USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME "Build with Arm Compute Library graph runtime" OFF)
+tvm_option(USE_TENSORRT "Build with TensorRT" OFF)


The message is a bit confusing. USE_TENSORRT means enabling the TensorRT codegen for graph partitininog. It doesn't require TensorRT to be available in the system environment. IIUC, maybe it's better to say "Build with TensorRT codegen", although I just found that "Build with Arm Compute Library" has the same issue.

@lhutton1 could you also share your thoughts about this?

Thanks for the review Cody!

You're right, the names aren't really that clear here. Originally, I had them as USE_TENSORRT_CODEGEN for codegen only and USE_TENSORRT for both codegen and runtime. I changed them to match the ACL definitions.

Agreed this is confusing. I think changing to ..._CODEGEN would be a better description of what the option actually does.

python/tvm/relay/op/contrib/tensorrt.py

docs/deploy/tensorrt.rst

comaniac · 2020-10-12T23:00:48Z

Given that we are not enabling TRT codegen on CI now due to the lack of TensorRT, I suggest we bypass this issue first to get the PR merged. Meanwhile, it would be better to have a troubleshooting in the TRT codegen tutorial.

zhiics · 2020-10-12T23:13:19Z

I agree that we can merge it first. But before that, @trevor-m could you rebase against the master and run the tests again locally to see if all of them pass? I am not sure if everything is oaky after the diagnostic error reporting was merged.

trevor-m · 2020-10-13T18:14:10Z

I agree that we can merge it first. But before that, @trevor-m could you rebase against the master and run the tests again locally to see if all of them pass? I am not sure if everything is oaky after the diagnostic error reporting was merged.

Thanks Zhi, I ran the test_tensorrt.py locally for both codegen only and with runtime and all of the tests passed.

zhiics · 2020-10-13T18:18:28Z

I think we can enable the test and merge after #6679 is landed since its pretty close already? Sorry for the back and forth.

trevor-m · 2020-10-13T18:19:20Z

I think we can enable the test and merge after #6679 is landed since its pretty close already?

Agreed, makes sense. I'll renable test once #6679 is merged.

tqchen · 2020-10-13T21:34:07Z

Notably we will need to wait until the docker image is updated instead just the PR merge. I believe @jroesch might be working on an image update that we can let him chime in once it lands. hopefully not blocking it for too long. We can also land and re-enable later

Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting

…IME->USE_TENSORRT_RUNTIME

trevor-m · 2020-10-19T23:00:20Z

CI has passed with USE_TENSORRT_CODEGEN ON since the new CI container is used.

zhiics · 2020-10-19T23:03:40Z

Thanks @trevor-m @comaniac @lhutton1 @leandron

* TensorRT integration using JSONRuntime Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting * Documentation changes * Address comments * Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME * Fix comment typo * Test CI without TRT codegen enabled * formatting * Enable USE_TENSORRT_CODEGEN in CI * Change file_util.h -> file_utils.h

ehion · 2023-10-17T15:08:16Z

i am curious how to run autotvm with tensorrt when i run

mod = partition_for_tensorrt(mod, params)
*********
tasks = autotvm.task.extract_from_program(
        mod['main'], target=target, params=params
        # , ops=(relay.op.get("nn.conv2d"),)
    )

error log :
ValueError: Cannot find global var "tvmgen_default_tensorrt_main_0" in the Module
candidates are: ["main"]

i can see a example here:tvm/tests/python/unittest/test_meta_schedule_byoc_tensorrt.py
but can't find other doc about autotvm with tensorrt support
hope to provide a doc about how to run tensorrt in autotvm,thx!

trevor-m force-pushed the trt-json-runtime branch from 19cb494 to 9024bf2 Compare September 3, 2020 19:59

trevor-m force-pushed the trt-json-runtime branch from 9024bf2 to a44e02d Compare September 3, 2020 20:05

zhiics self-assigned this Sep 3, 2020

leandron reviewed Sep 4, 2020

View reviewed changes

ZihengJiang added the status: need update need update based on feedbacks label Sep 9, 2020

lhutton1 reviewed Sep 15, 2020

View reviewed changes

tqchen added the status: need review label Sep 15, 2020

zhiics reviewed Sep 16, 2020

View reviewed changes

lhutton1 reviewed Sep 18, 2020

View reviewed changes

comaniac requested changes Sep 21, 2020

View reviewed changes

trevor-m mentioned this pull request Oct 13, 2020

[CI] Install xgboost>=1.1.0 in CI container #6679

Merged

Trevor Morris added 8 commits October 19, 2020 17:48

Documentation changes

298b3f2

Address comments

8760c65

Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNT…

13186c7

…IME->USE_TENSORRT_RUNTIME

Fix comment typo

5603ac9

Test CI without TRT codegen enabled

1e8763f

formatting

4db9a64

Enable USE_TENSORRT_CODEGEN in CI

240f665

trevor-m force-pushed the trt-json-runtime branch from 5c8a1cd to 240f665 Compare October 19, 2020 17:50

Change file_util.h -> file_utils.h

0d9bf62

zhiics merged commit af8636a into apache:main Oct 19, 2020

zhiics added status: accepted and removed status: need review status: need update need update based on feedbacks labels Oct 19, 2020

trevor-m mentioned this pull request Dec 2, 2020

Sync from apache/tvm 12/2/2020, Replace TRT integration with JSONRuntime implementation neo-ai/tvm#181

Merged

junrushao mentioned this pull request Nov 1, 2021

Apache TVM v0.8 Release Note Candidate #9416

Closed

[BYOC][TensorRT] TensorRT BYOC integration #6395

[BYOC][TensorRT] TensorRT BYOC integration #6395

Conversation

trevor-m commented Sep 3, 2020 • edited Loading

Building

Usage

High level components

Partitioning

Codegen

Runtime

Areas I'm looking for feedback and ideas

zhiics commented Sep 3, 2020

zhiics commented Sep 3, 2020

masahi commented Sep 3, 2020 • edited Loading

trevor-m commented Sep 3, 2020

comaniac commented Sep 3, 2020 • edited Loading

zhiics commented Sep 3, 2020

leandron left a comment

Choose a reason for hiding this comment

trevor-m commented Sep 14, 2020

trevor-m commented Sep 14, 2020 • edited Loading

comaniac commented Sep 14, 2020

trevor-m commented Sep 14, 2020

comaniac commented Sep 14, 2020

trevor-m commented Sep 14, 2020

zhiics commented Sep 14, 2020

lhutton1 left a comment

Choose a reason for hiding this comment

trevor-m commented Sep 15, 2020

lhutton1 commented Sep 15, 2020

trevor-m commented Sep 15, 2020

comaniac commented Sep 15, 2020

comaniac commented Sep 15, 2020

areusch commented Sep 16, 2020

Choose a reason for hiding this comment

trevor-m Sep 16, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevor-m Sep 18, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhiics commented Sep 18, 2020

comaniac left a comment

Choose a reason for hiding this comment

comaniac Sep 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac commented Oct 12, 2020

zhiics commented Oct 12, 2020

trevor-m commented Oct 13, 2020

zhiics commented Oct 13, 2020 • edited Loading

trevor-m commented Oct 13, 2020

tqchen commented Oct 13, 2020

trevor-m commented Oct 19, 2020

zhiics commented Oct 19, 2020

ehion commented Oct 17, 2023 • edited Loading

trevor-m commented Sep 3, 2020 •

edited

Loading

masahi commented Sep 3, 2020 •

edited

Loading

comaniac commented Sep 3, 2020 •

edited

Loading

trevor-m commented Sep 14, 2020 •

edited

Loading

trevor-m Sep 16, 2020 •

edited

Loading

trevor-m Sep 18, 2020 •

edited

Loading

comaniac Sep 21, 2020 •

edited

Loading

zhiics commented Oct 13, 2020 •

edited

Loading

ehion commented Oct 17, 2023 •

edited

Loading