-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BYOC][TensorRT] TensorRT BYOC integration #6395
Conversation
19cb494
to
9024bf2
Compare
9024bf2
to
a44e02d
Compare
Usually we need a separate PR to install the library/environment in the CI first (TRT in this case) so that we can have better e2e tests. |
How about using Config mechanism? I learned about this from ethos integration (thanks @mbaret) and it cleaned up my code as well. See the definition of ConfigNode below and its usage (grep for |
Thanks @masahi! Let me look into this. |
For the rest 2 points.
def conv2d(...):
if not tvm.get_global_func("relay.tensorrt.version", True):
return False
ver = tvm.get_global_func("relay.tensorrt.version")
if ver == '1.0':
return True
return False If you need manually set up the TensorRT version, then it could be like this: Let user specify it in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch. I reviewed mostly the Python sources, and have a few comments. One thing that needs fixing it the use of print
statements, to be replaced with logging
.
Thanks everyone for the great feedback! I've been busy lately, but I plan to start addressing the comments this week. |
Thanks @comaniac!
Hmm, this seems like it would make the job of the Btw just for referece, here is the general implementation of PruneSubgraph that I originally implemented: trevor-m@06015a4
I have already created an API to retrieve the TRT version if TVM is compiled with the TRT runtime enabled. However, one of our use cases is to use TVM on a CPU-only instance to cross-compile models. For that use case, we want to be able to target compilation for different TRT versions - this affects the partitioning rules mostly. I don't think having to rebuild TVM for each target version will be a good solution. Is it possible for my annotation functions to access the pass context and therefore a TRT config that I will be adding as @masahi suggested? I don't see any other python code accessing the PassContext though... |
My main concern was that it would be tedious to have a On the other hand, in order to not block this PR for too long, we can maybe follow the current flow first, and discuss a plan of refactoring the partition pass to better support this requirement. @zhiics do you have any suggestion?
Looks like |
Thanks! That makes sense. My implementation seems tightly coupled to how PartitionGraph works. I like the idea of adding the callback to PartitionGraph. After it puts together the function, it can check if there is a validation function registered and call it to see if it should keep the subgraph or not. Both MXNet and TF have a mechanism like this as a final check on the subgraph in their partitioning algorithms. I agree that solving this problem is probably best done in a separate PR. |
Zhi just pointed to me offline about how to access pass context configs in Python. Here is an example: import tvm
with tvm.transform.PassContext(config={"relay.fallback_device_type": 5}):
pass_ctx = tvm.transform.PassContext.current()
print(pass_ctx.config["relay.fallback_device_type"]) |
I see, in that case I think my current implementation using a global variable is fine since it is all confined within the one file.
Nice! Let me try that. |
Yeah, I think its okay to have a refinement pass for TRT ATM since doing such a decision in the current partitioning is not easy. In the long run, we should make the partitioning pass more intelligent by taking in some configurations and partitioning over the region accordingly. Or we can consider some of the configs when merging the regions. That would need more investigation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of minor suggestions and comments - looks good overall, without regard to comments/concerns above.
One overall suggestion from a users perspective, it might be useful to write a beginners guide stating how to install and build with TensorRT support. Feel free to ignore though.
Thanks @lhutton1 for the review! Is |
Np @trevor-m, it seems that way, at least that's where I added the Arm Compute Library doc |
There appears to be an inconsistency in the CI between
If I change it to 2 spaces, the cpplint passes but the clang-format checker fails:
Is there a reason we have both of these checks? They appear to do the same thing but want different things. Is there a way to override one of them? |
Looks like a conflict between cpplint and clang-format-10. clang-format-10 result seems more reasonable, so we may need to fix the cpplint. cc @areusch |
@t-vi looks like this is a common issue in cpplint as it only uses regular expression to lint the code. While it's not straightforward to fix cpplint, you may work around this issue by rewriting the code like #define TRT_VERSION_GE(major, minor, patch) ( \
(NV_TENSORRT_MAJOR > major) || (NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR > minor) || \
(NV_TENSORRT_MAJOR == major && NV_TENSORRT_MINOR == minor && NV_TENSORRT_PATCH >= patch)) So that both cpplint and clang-format are happy. |
looks like we are on cpplint 1.4.5. I tried running it at 1.5.4, latest release, but no change. There is an option in clang-format (AlignAfterOpenBracket) to control bracket alignment style, but the google style-guide setting it's at right now I think is the most reasonable. Further, I don't think cpplint can differentiate between #define continuations and regular code, so we can't configure that there. you could try manually fixing the line to make cpplint happy, then add on surrounding lines: this seems like an edge case in cpplint, so i'd vote for that. |
return self.var_map[var] | ||
return super().visit_var(var) | ||
|
||
class SubgraphRemover(ExprMutator): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just remove the attributes of the function, e.g. inline, and then run the inline pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally tried that approach. However, when the tensorrt subgraphs are inlined, TVM will try to optimize the code in the tensorrt subgraphs (for example it will change conv2d to contrib_conv2d_winograd_without_weight_transform) which we don't want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the issue we discussed in this PR about how to deal with post-partitioning judgements. We could later on figure out an approach to generalize this requirement.
* already built TRT engines and load into trt_engine_cache_ so they don't | ||
* have to be built at first inference. | ||
*/ | ||
bool GetCachedEnginesFromDisk() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we need these two serialization methods. Can we just rely on LoadFrom/SaveToBinary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building the TensorRT engine which is done on the first inference can be very slow. On edge devices it can even take up to an hour. NVIDIA provides an API to serialize/load the built TRT engine after it is built to avoid repeating this slow process.
This serialization method is separate from the LoadFrom/SaveToBinary and is there to expose TRT's engine serialization/loading API to the user so they won't have to rebuild the engine every time they load the model.
There is some more info here: https://neo-ai-dlr.readthedocs.io/en/latest/tensorrt.html#caching-tensorrt-engines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you override the default SaveToBinary in the json runtime and optionally save the engine if one exists (and/or based on a config option)? When LoadFromBinary is called, since you have defined your own serialization method you can check for the existence of the engine and load it back. Essentially you have two different serialization/deserialization methods which you can alternate between in LoadFrom/SaveToBinary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK SaveToBinary
is only ever invoked during compilation.
The engine is only built during runtime because it is specific to the target GPU and platform, so CacheEnginesToDisk
needs to be performed by the runtime also.
See https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work
To optimize your model for inference, TensorRT takes your network definition, performs optimizations including platform-specific optimizations, and generates the inference engine. This process is referred to as the build phase. The build phase can take considerable time, especially when running on embedded platforms. Therefore, a typical application will build an engine once, and then serialize it as a plan file for later use.
Note: The generated plan files are not portable across platforms or TensorRT versions. Plans are specific to the exact GPU model they were built on (in addition to the platforms and the TensorRT version) and must be re-targeted to the specific GPU in case you want to run them on a different GPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting discussion. I realized that this is more like a serialization for platform-dependent TensorRT engines. If it's not possible to build and serialize the engine during the compilation (or cross-compilation) even we have built the TVM with TensorRT runtime, then this is probably inevitable; otherwise we may build the engine and serialize the bit-stream along with other artifacts in SaveToBinary
.
If the serialization here is inevitable, which I believe in it because users may not have TensorRT during compilation, then the next question is whether we can update the ".so" file with the serialized engine here instead of creating a separate file. In other words, the .so file may or may not contain a serialized engine, but if it has, we don't need to build it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @comaniac that is correct. The engine is platform-dependent so it is not possible to create it during the compilation, it must be done at runtime.
It think it is an interesting idea to update the .so
with the built engine. I think the TVM runtime doesn't contain the necessary components to be able to serialize to .so. It could also introduce some weird behavior (you run a model on one NVIDIA device, it stores the built engine in the .so, then you take the model and try to run it on a different NVIDIA device and it wouldn't work).
This extra serialization is not required to use TRT which is why it is only exposed via an optional environment variable. It is useful for edge devices however where building the TRT engine can take up to an hour.
* already built TRT engines and load into trt_engine_cache_ so they don't | ||
* have to be built at first inference. | ||
*/ | ||
bool GetCachedEnginesFromDisk() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you override the default SaveToBinary in the json runtime and optionally save the engine if one exists (and/or based on a config option)? When LoadFromBinary is called, since you have defined your own serialization method you can check for the existence of the engine and load it back. Essentially you have two different serialization/deserialization methods which you can alternate between in LoadFrom/SaveToBinary
cc @wpan11nv this might be something you or some of your Nvidia forks interested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finish reviewing the tutorial and the Python code.
CMakeLists.txt
Outdated
@@ -76,6 +76,8 @@ tvm_option(USE_COREML "Build with coreml support" OFF) | |||
tvm_option(USE_TARGET_ONNX "Build with ONNX Codegen support" OFF) | |||
tvm_option(USE_ARM_COMPUTE_LIB "Build with Arm Compute Library" OFF) | |||
tvm_option(USE_ARM_COMPUTE_LIB_GRAPH_RUNTIME "Build with Arm Compute Library graph runtime" OFF) | |||
tvm_option(USE_TENSORRT "Build with TensorRT" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The message is a bit confusing. USE_TENSORRT
means enabling the TensorRT codegen for graph partitininog. It doesn't require TensorRT to be available in the system environment. IIUC, maybe it's better to say "Build with TensorRT codegen", although I just found that "Build with Arm Compute Library" has the same issue.
@lhutton1 could you also share your thoughts about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review Cody!
You're right, the names aren't really that clear here. Originally, I had them as USE_TENSORRT_CODEGEN
for codegen only and USE_TENSORRT
for both codegen and runtime. I changed them to match the ACL definitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed this is confusing. I think changing to ..._CODEGEN
would be a better description of what the option actually does.
Given that we are not enabling TRT codegen on CI now due to the lack of TensorRT, I suggest we bypass this issue first to get the PR merged. Meanwhile, it would be better to have a troubleshooting in the TRT codegen tutorial. |
I agree that we can merge it first. But before that, @trevor-m could you rebase against the master and run the tests again locally to see if all of them pass? I am not sure if everything is oaky after the diagnostic error reporting was merged. |
Thanks Zhi, I ran the |
I think we can enable the test and merge after #6679 is landed since its pretty close already? Sorry for the back and forth. |
Notably we will need to wait until the docker image is updated instead just the PR merge. I believe @jroesch might be working on an image update that we can let him chime in once it lands. hopefully not blocking it for too long. We can also land and re-enable later |
Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting
…IME->USE_TENSORRT_RUNTIME
5c8a1cd
to
240f665
Compare
CI has passed with USE_TENSORRT_CODEGEN ON since the new CI container is used. |
* TensorRT integration using JSONRuntime Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting * Documentation changes * Address comments * Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME * Fix comment typo * Test CI without TRT codegen enabled * formatting * Enable USE_TENSORRT_CODEGEN in CI * Change file_util.h -> file_utils.h
* TensorRT integration using JSONRuntime Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting * Documentation changes * Address comments * Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME * Fix comment typo * Test CI without TRT codegen enabled * formatting * Enable USE_TENSORRT_CODEGEN in CI * Change file_util.h -> file_utils.h
* TensorRT integration using JSONRuntime Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting * Documentation changes * Address comments * Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME * Fix comment typo * Test CI without TRT codegen enabled * formatting * Enable USE_TENSORRT_CODEGEN in CI * Change file_util.h -> file_utils.h
* TensorRT integration using JSONRuntime Support input nodes with multiple data entries Fix failing tests Support layout transform, add engine caching Add comment Add PruneSubgraph pass Use prune_subgraph pass, make params member of trt runtime class Hide deprecation warnings coming from TRT headers Remove general prune subgraph Save/load use_implicit_batch and workspace size Clean up Fix cpp lint Addressing review comments Refactor tests Use relay.bind instead of VarReplacer. Improve some annotation functions Add TRT docs Use DLOG, formatting Use logging.info instead of print also refactor integ tests also refactor integ tests Formatting Formatting Format python fix python format Fix pylint Fix sphinx precheck Add tensorrt.rst to toctree Allow codegen to be tested when TRT runtime is not available. Enable TRT codegen in CI linty Address more comments Formatting Formatting * Documentation changes * Address comments * Rename USE_TENSORRT->USE_TENSORRT_CODEGEN and USE_TENSORRT_GRAPH_RUNTIME->USE_TENSORRT_RUNTIME * Fix comment typo * Test CI without TRT codegen enabled * formatting * Enable USE_TENSORRT_CODEGEN in CI * Change file_util.h -> file_utils.h
i am curious how to run autotvm with tensorrt when i run
error log : i can see a example here:tvm/tests/python/unittest/test_meta_schedule_byoc_tensorrt.py |
This PR adds support for partitioning, compiling, and running the TensorRT BYOC target.
Building
There are two new cmake flags:
USE_TENSORRT=ON/OFF
: enables TensorRT code generation - this does not require TensorRT librariesUSE_TENSORRT_GRAPH_RUNTIME=ON/OFF/"path/to/TensorRT"
: enables TensorRT runtime - this requires TensorRT libraries. A system wide install of TensorRT from a deb package or JetPack can be detected by "ON", but a .tar.gz installation requires you to provide path to the extracted TensorRT archive.Usage
The compilation target should be "cuda" to ensure that input and output args to the TensorRT functions are placed on the GPU.
High level components
Partitioning
The annotation rules for TensorRT change depending on the version of TensorRT that is being targeted as well as the "batching mode". This can be configured with the
trt_version
anduse_implicit_batch
args ofpartition_for_tensorrt
.If TVM was built against the TensorRT library, the linked version is used for partitioning instead.
Codegen
This implementation using the JSONRuntime
JSONSerializer
base class for codegen to serialize the relay expression to a json format.Runtime
Runtime is handled by the runtime module class in
tensorrt_runtime.cc
. During runtime, it first uses theTensorRTBuilder
class (tensorrt_builder.cc
) is used to convert the json graph to a TensorRTINetworkDefinition
using TensorRT APIs. It uses the op converter classes intensorrt_ops.cc
to do this. Then, the TensorRT engine is built, this process can take up to a few minutes because TensorRT will perform its optimizations at this point. The engine is cached for further inference calls.The runtime can be compiled against many TensorRT versions thanks to if guards. It will work for TensorRT 5, 6, and 7. However, the compiled model must have been partitioned for a TensorRT version <= the version used at runtime. Otherwise, the compiled model may expect ops to be available which require a newer version of TensorRT.
Areas I'm looking for feedback and ideas
TensorRT has parameters such as
max_workspace_size
anduse_implicit_batch
which I want the user to be able to supply inpartition_for_tensorrt
. These parameters need to be passed along to the codegen and stored in the serialized graph until runtime.use_implicit_batch
also influences the partitioning rules. Currently, I'm using environment variables to pass these from python to the codegen in C++. I wonder if there is a better way to do this?I've implemented a transformation called
prune_tensorrt_subgraphs()
inpython/tvm/relay/op/contrib/tensorrt.py
. This is run after partitioning and allows me to decide whether to keep a subgraph or return it back to the typical TVM compilation path. This is needed because some subgraphs could be invalid - such as when the inputs have different batch sizes or for optimization purposes if the subgraph has no multiply-accumulates. I have also implemented a general version of this in C++, but it uses the global registry to allow each codegen target to define its ownis_invalid_subgraph
callback. In the future we can switch to the generic version if we find a better way to register the callbacks.The targeted tensorrt version needs to be accessed during annotation. I've put it in a global variable for now.