Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Plugin EfficentNMSX #3920

Open
wants to merge 2 commits into
base: release/10.0
Choose a base branch
from

Conversation

levipereira
Copy link

@levipereira levipereira commented Jun 1, 2024

@samurdhikaru @johnnynunez
Continuing the discussion initiated in PR #3859
I removed the YoloNMS plugin and created a new plugin named EfficientNMSX (with 'X' representing the index) within the structure of the EfficientNMSPlugin. The changes involved creating a new plugin using the current index generation logic and simply adding a new layer that returns detection indices.

Since the changes were minimal, I did not implement the IPluginV3 interface, as that would have required a complete overhaul of the entire plugin structure.

I conducted all tests on both the EfficientNMS_TRT and EfficientNMSX_TRT plugins, and both functioned correctly.

I would appreciate your suggestion regarding the IPluginV3 implementation. Should we update the entire EfficientNMSPlugin now, or should we continue with the current approach and make the switch to IPluginV3 all at once during a future upgrade?

Theses changes also was implemented on release/8.6 ( only for test purposes)
https://github.com/levipereira/TensorRT/tree/release/8.6

Signed-off-by: Levi Pereira <levi.pereira@gmail.com>
Signed-off-by: Levi Pereira <levi.pereira@gmail.com>
@levipereira
Copy link
Author

@samurdhikaru
https://github.com/levipereira/deepstream-yolo-e2e/
I would like to share a real example of implementing End2End using the EfficientNMSX plugin for instance segmentation models. I have successfully implemented the plugin in TensorRT 8.5, 8.6, and 10.0 in my repository and am currently using it with Triton Server and DeepStream.
As we discussed earlier, enabling the community to implement End2End in models, instead of relying on post-processing, significantly reduces model latency. I believe considering this pull request will greatly benefit the community.

@demuxin
Copy link

demuxin commented Jun 27, 2024

Hi @levipereira , thank you for your work. Can you provide a demo on using output indexes?

As far as I know, TensorRT has a issue about data dependent shape(dds). You can look at this link.

Can EfficientNMSX plugin resolve this issue?

@levipereira
Copy link
Author

levipereira commented Jun 27, 2024

@andrew-93
Copy link

Hi @levipereira , I have some problems working with your plugin.
First, I will describe the stages, step by step:

  1. I download TensorRT from your repository and install as shown below
git clone https://github.com/levipereira/TensorRT.git --branch release/8.5 --recurse-submodules
mkdir -p TensorRT/build && cd TensorRT/build
cmake ..
cd samples/trtexec
make -j$(nproc)
cd /tmp/TensorRT/build
cp libnvcaffeparser.so* /usr/lib/x86_64-linux-gnu/
mkdir -p /usr/src/tensorrt/bin
cp trtexec /usr/src/tensorrt/bin/
  1. Download weights:
cd /my_folder/BEST_WEIGHTS/
wget https://github.com/levipereira/yolo_e2e/releases/download/v1.0/yolov8x-seg-trt.onnx
  1. I try to run trtexec with this onnx:
    /usr/src/tensorrt/bin/trtexec --onnx="/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx"

Then I get an error:

[08/01/2024-12:00:57] [I] === Model Options ===
[08/01/2024-12:00:57] [I] Format: ONNX
[08/01/2024-12:00:57] [I] Model: /my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx
[08/01/2024-12:00:57] [I] Output:
[08/01/2024-12:00:57] [I] === Build Options ===
[08/01/2024-12:00:57] [I] Max batch: explicit batch
[08/01/2024-12:00:57] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/01/2024-12:00:57] [I] minTiming: 1
[08/01/2024-12:00:57] [I] avgTiming: 8
[08/01/2024-12:00:57] [I] Precision: FP32
[08/01/2024-12:00:57] [I] LayerPrecisions: 
[08/01/2024-12:00:57] [I] Layer Device Types: 
[08/01/2024-12:00:57] [I] Calibration: 
[08/01/2024-12:00:57] [I] Refit: Disabled
[08/01/2024-12:00:57] [I] Version Compatible: Disabled
[08/01/2024-12:00:57] [I] ONNX Native InstanceNorm: Disabled
[08/01/2024-12:00:57] [I] TensorRT runtime: full
[08/01/2024-12:00:57] [I] Lean DLL Path: 
[08/01/2024-12:00:57] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/01/2024-12:00:57] [I] Exclude Lean Runtime: Disabled
[08/01/2024-12:00:57] [I] Sparsity: Disabled
[08/01/2024-12:00:57] [I] Safe mode: Disabled
[08/01/2024-12:00:57] [I] Build DLA standalone loadable: Disabled
[08/01/2024-12:00:57] [I] Allow GPU fallback for DLA: Disabled
[08/01/2024-12:00:57] [I] DirectIO mode: Disabled
[08/01/2024-12:00:57] [I] Restricted mode: Disabled
[08/01/2024-12:00:57] [I] Skip inference: Disabled
[08/01/2024-12:00:57] [I] Save engine: 
[08/01/2024-12:00:57] [I] Load engine: 
[08/01/2024-12:00:57] [I] Profiling verbosity: 0
[08/01/2024-12:00:57] [I] Tactic sources: Using default tactic sources
[08/01/2024-12:00:57] [I] timingCacheMode: local
[08/01/2024-12:00:57] [I] timingCacheFile: 
[08/01/2024-12:00:57] [I] Heuristic: Disabled
[08/01/2024-12:00:57] [I] Preview Features: Use default preview flags.
[08/01/2024-12:00:57] [I] MaxAuxStreams: -1
[08/01/2024-12:00:57] [I] BuilderOptimizationLevel: -1
[08/01/2024-12:00:57] [I] Input(s)s format: fp32:CHW
[08/01/2024-12:00:57] [I] Output(s)s format: fp32:CHW
[08/01/2024-12:00:57] [I] Input build shapes: model
[08/01/2024-12:00:57] [I] Input calibration shapes: model
[08/01/2024-12:00:57] [I] === System Options ===
[08/01/2024-12:00:57] [I] Device: 0
[08/01/2024-12:00:57] [I] DLACore: 
[08/01/2024-12:00:57] [I] Plugins:
[08/01/2024-12:00:57] [I] setPluginsToSerialize:
[08/01/2024-12:00:57] [I] dynamicPlugins:
[08/01/2024-12:00:57] [I] ignoreParsedPluginLibs: 0
[08/01/2024-12:00:57] [I] 
[08/01/2024-12:00:57] [I] === Inference Options ===
[08/01/2024-12:00:57] [I] Batch: Explicit
[08/01/2024-12:00:57] [I] Input inference shapes: model
[08/01/2024-12:00:57] [I] Iterations: 10
[08/01/2024-12:00:57] [I] Duration: 3s (+ 200ms warm up)
[08/01/2024-12:00:57] [I] Sleep time: 0ms
[08/01/2024-12:00:57] [I] Idle time: 0ms
[08/01/2024-12:00:57] [I] Inference Streams: 1
[08/01/2024-12:00:57] [I] ExposeDMA: Disabled
[08/01/2024-12:00:57] [I] Data transfers: Enabled
[08/01/2024-12:00:57] [I] Spin-wait: Disabled
[08/01/2024-12:00:57] [I] Multithreading: Disabled
[08/01/2024-12:00:57] [I] CUDA Graph: Disabled
[08/01/2024-12:00:57] [I] Separate profiling: Disabled
[08/01/2024-12:00:57] [I] Time Deserialize: Disabled
[08/01/2024-12:00:57] [I] Time Refit: Disabled
[08/01/2024-12:00:57] [I] NVTX verbosity: 0
[08/01/2024-12:00:57] [I] Persistent Cache Ratio: 0
[08/01/2024-12:00:57] [I] Inputs:
[08/01/2024-12:00:57] [I] === Reporting Options ===
[08/01/2024-12:00:57] [I] Verbose: Disabled
[08/01/2024-12:00:57] [I] Averages: 10 inferences
[08/01/2024-12:00:57] [I] Percentiles: 90,95,99
[08/01/2024-12:00:57] [I] Dump refittable layers:Disabled
[08/01/2024-12:00:57] [I] Dump output: Disabled
[08/01/2024-12:00:57] [I] Profile: Disabled
[08/01/2024-12:00:57] [I] Export timing to JSON file: 
[08/01/2024-12:00:57] [I] Export output to JSON file: 
[08/01/2024-12:00:57] [I] Export profile to JSON file: 
[08/01/2024-12:00:57] [I] 
[08/01/2024-12:00:57] [I] === Device Information ===
[08/01/2024-12:00:57] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[08/01/2024-12:00:57] [I] Compute Capability: 7.5
[08/01/2024-12:00:57] [I] SMs: 68
[08/01/2024-12:00:57] [I] Device Global Memory: 11008 MiB
[08/01/2024-12:00:57] [I] Shared Memory per SM: 64 KiB
[08/01/2024-12:00:57] [I] Memory Bus Width: 352 bits (ECC disabled)
[08/01/2024-12:00:57] [I] Application Compute Clock Rate: 1.545 GHz
[08/01/2024-12:00:57] [I] Application Memory Clock Rate: 7 GHz
[08/01/2024-12:00:57] [I] 
[08/01/2024-12:00:57] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/01/2024-12:00:57] [I] 
[08/01/2024-12:00:57] [I] TensorRT version: 8.6.1
[08/01/2024-12:00:57] [I] Loading standard plugins
[08/01/2024-12:00:58] [I] [TRT] [MemUsageChange] Init CUDA: CPU +308, GPU +0, now: CPU 321, GPU 474 (MiB)
[08/01/2024-12:01:00] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +263, GPU +76, now: CPU 638, GPU 550 (MiB)
[08/01/2024-12:01:00] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[08/01/2024-12:01:00] [I] Start parsing network model.
[08/01/2024-12:01:00] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-12:01:00] [I] [TRT] Input filename:   /my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx
[08/01/2024-12:01:00] [I] [TRT] ONNX IR version:  0.0.7
[08/01/2024-12:01:00] [I] [TRT] Opset version:    14
[08/01/2024-12:01:00] [I] [TRT] Producer name:    pytorch
[08/01/2024-12:01:00] [I] [TRT] Producer version: 1.14.0
[08/01/2024-12:01:00] [I] [TRT] Domain:           
[08/01/2024-12:01:00] [I] [TRT] Model version:    0
[08/01/2024-12:01:00] [I] [TRT] Doc string:       
[08/01/2024-12:01:00] [I] [TRT] ----------------------------------------------------------------
[08/01/2024-12:01:00] [W] [TRT] onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/01/2024-12:01:01] [I] [TRT] No importer registered for op: EfficientNMSX_TRT. Attempting to import as plugin.
[08/01/2024-12:01:01] [I] [TRT] Searching for plugin: EfficientNMSX_TRT, plugin_version: 1, plugin_namespace: 
[08/01/2024-12:01:01] [E] [TRT] ModelImporter.cpp:726: While parsing node number 430 [EfficientNMSX_TRT -> "num_dets"]:
[08/01/2024-12:01:01] [E] [TRT] ModelImporter.cpp:727: --- Begin node ---
[08/01/2024-12:01:01] [E] [TRT] ModelImporter.cpp:728: input: "/end2end/Unsqueeze_output_0"
input: "/end2end/Slice_4_output_0"
output: "num_dets"
output: "det_boxes"
output: "det_scores"
output: "det_classes"
output: "/end2end/EfficientNMSX_TRT_output_4"
name: "/end2end/EfficientNMSX_TRT"
op_type: "EfficientNMSX_TRT"
attribute {
  name: "background_class"
  ints: -1
  type: INTS
}
attribute {
  name: "box_coding"
  ints: 1
  type: INTS
}
attribute {
  name: "iou_threshold"
  f: 0.45
  type: FLOAT
}
attribute {
  name: "max_output_boxes"
  i: 100
  type: INT
}
attribute {
  name: "plugin_version"
  s: "1"
  type: STRING
}
attribute {
  name: "score_activation"
  i: 0
  type: INT
}
attribute {
  name: "score_threshold"
  f: 0.25
  type: FLOAT
}
domain: "TRT"

[08/01/2024-12:01:01] [E] [TRT] ModelImporter.cpp:729: --- End node ---
[08/01/2024-12:01:01] [E] [TRT] ModelImporter.cpp:732: ERROR: builtin_op_importers.cpp:5428 In function importFallbackPluginImporter:
[8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"
[08/01/2024-12:01:01] [E] Failed to parse onnx file
[08/01/2024-12:01:01] [I] Finished parsing network model. Parse time: 0.901706
[08/01/2024-12:01:01] [E] Parsing model failed
[08/01/2024-12:01:01] [E] Failed to create engine from model or file.
[08/01/2024-12:01:01] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v8601] # /usr/src/tensorrt/bin/trtexec --onnx=/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx

How to fix this problem?

@andrew-93
Copy link

Moreover, this problem occurs both in the case of release/8.5 branch and in the case of release/8.6 branch.

@levipereira
Copy link
Author

Moreover, this problem occurs both in the case of release/8.5 branch and in the case of release/8.6 branch.

It’s possible that the library isn’t being updated correctly in/usr/lib/x86_64-linux-gnu/ . I’ve already compiled the library for x86 environments, so you can skip the compilation steps and use the pre-compiled libraries available in my repository.

You can easily update and use the models by executing the following script:

GitHub Repository: deepstream-yolo-e2e - TensorRT Plugin - patch_libnvinfer.sh

@andrew-93
Copy link

@levipereira Is it possible to somehow build "trtexec" files for x86_64 and aarch64 (Jetson) for TensorRT 8.5? I really need these files to convert from onnx to engine, which I will run using my c++ code

@levipereira
Copy link
Author

levipereira commented Aug 2, 2024

You only need the libnvinfer_plugin.so library. I have it for x86_64 for TRT 8.5 , but for aarch64, I have only compiled it for TRT 8.6.
https://github.com/levipereira/deepstream-yolo-e2e/blob/master/TensorRTPlugin/patch_libnvinfer.sh#L18

@andrew-93
Copy link

andrew-93 commented Aug 2, 2024

Now I get new error:

root@13a029675ac6:/tmp# /usr/src/tensorrt/bin/trtexec --onnx="/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx" --saveEngine="/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.engine" --shapes="images:1x3x512x1024" --fp16 --explicitBatch --workspace=4096 --useCudaGraph
&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # /usr/src/tensorrt/bin/trtexec --onnx=/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx --saveEngine=/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.engine --shapes=images:1x3x512x1024 --fp16 --explicitBatch --workspace=4096 --useCudaGraph
[08/02/2024-16:18:27] [W] --explicitBatch flag has been deprecated and has no effect!
[08/02/2024-16:18:27] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[08/02/2024-16:18:27] [W] --workspace flag has been deprecated by --memPoolSize flag.
[08/02/2024-16:18:27] [I] === Model Options ===
[08/02/2024-16:18:27] [I] Format: ONNX
[08/02/2024-16:18:27] [I] Model: /my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx
[08/02/2024-16:18:27] [I] Output:
[08/02/2024-16:18:27] [I] === Build Options ===
[08/02/2024-16:18:27] [I] Max batch: explicit batch
[08/02/2024-16:18:27] [I] Memory Pools: workspace: 4096 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/02/2024-16:18:27] [I] minTiming: 1
[08/02/2024-16:18:27] [I] avgTiming: 8
[08/02/2024-16:18:27] [I] Precision: FP32+FP16
[08/02/2024-16:18:27] [I] LayerPrecisions: 
[08/02/2024-16:18:27] [I] Calibration: 
[08/02/2024-16:18:27] [I] Refit: Disabled
[08/02/2024-16:18:27] [I] Sparsity: Disabled
[08/02/2024-16:18:27] [I] Safe mode: Disabled
[08/02/2024-16:18:27] [I] DirectIO mode: Disabled
[08/02/2024-16:18:27] [I] Restricted mode: Disabled
[08/02/2024-16:18:27] [I] Build only: Disabled
[08/02/2024-16:18:27] [I] Save engine: /my_folder/BEST_WEIGHTS/yolov8x-seg-trt.engine
[08/02/2024-16:18:27] [I] Load engine: 
[08/02/2024-16:18:27] [I] Profiling verbosity: 0
[08/02/2024-16:18:27] [I] Tactic sources: Using default tactic sources
[08/02/2024-16:18:27] [I] timingCacheMode: local
[08/02/2024-16:18:27] [I] timingCacheFile: 
[08/02/2024-16:18:27] [I] Heuristic: Disabled
[08/02/2024-16:18:27] [I] Preview Features: Use default preview flags.
[08/02/2024-16:18:27] [I] Input(s)s format: fp32:CHW
[08/02/2024-16:18:27] [I] Output(s)s format: fp32:CHW
[08/02/2024-16:18:27] [I] Input build shape: images=1x3x512x1024+1x3x512x1024+1x3x512x1024
[08/02/2024-16:18:27] [I] Input calibration shapes: model
[08/02/2024-16:18:27] [I] === System Options ===
[08/02/2024-16:18:27] [I] Device: 0
[08/02/2024-16:18:27] [I] DLACore: 
[08/02/2024-16:18:27] [I] Plugins:
[08/02/2024-16:18:27] [I] === Inference Options ===
[08/02/2024-16:18:27] [I] Batch: Explicit
[08/02/2024-16:18:27] [I] Input inference shape: images=1x3x512x1024
[08/02/2024-16:18:27] [I] Iterations: 10
[08/02/2024-16:18:27] [I] Duration: 3s (+ 200ms warm up)
[08/02/2024-16:18:27] [I] Sleep time: 0ms
[08/02/2024-16:18:27] [I] Idle time: 0ms
[08/02/2024-16:18:27] [I] Streams: 1
[08/02/2024-16:18:27] [I] ExposeDMA: Disabled
[08/02/2024-16:18:27] [I] Data transfers: Enabled
[08/02/2024-16:18:27] [I] Spin-wait: Disabled
[08/02/2024-16:18:27] [I] Multithreading: Disabled
[08/02/2024-16:18:27] [I] CUDA Graph: Enabled
[08/02/2024-16:18:27] [I] Separate profiling: Disabled
[08/02/2024-16:18:27] [I] Time Deserialize: Disabled
[08/02/2024-16:18:27] [I] Time Refit: Disabled
[08/02/2024-16:18:27] [I] NVTX verbosity: 0
[08/02/2024-16:18:27] [I] Persistent Cache Ratio: 0
[08/02/2024-16:18:27] [I] Inputs:
[08/02/2024-16:18:27] [I] === Reporting Options ===
[08/02/2024-16:18:27] [I] Verbose: Disabled
[08/02/2024-16:18:27] [I] Averages: 10 inferences
[08/02/2024-16:18:27] [I] Percentiles: 90,95,99
[08/02/2024-16:18:27] [I] Dump refittable layers:Disabled
[08/02/2024-16:18:27] [I] Dump output: Disabled
[08/02/2024-16:18:27] [I] Profile: Disabled
[08/02/2024-16:18:27] [I] Export timing to JSON file: 
[08/02/2024-16:18:27] [I] Export output to JSON file: 
[08/02/2024-16:18:27] [I] Export profile to JSON file: 
[08/02/2024-16:18:27] [I] 
[08/02/2024-16:18:27] [I] === Device Information ===
[08/02/2024-16:18:27] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[08/02/2024-16:18:27] [I] Compute Capability: 7.5
[08/02/2024-16:18:27] [I] SMs: 68
[08/02/2024-16:18:27] [I] Compute Clock Rate: 1.545 GHz
[08/02/2024-16:18:27] [I] Device Global Memory: 11008 MiB
[08/02/2024-16:18:27] [I] Shared Memory per SM: 64 KiB
[08/02/2024-16:18:27] [I] Memory Bus Width: 352 bits (ECC disabled)
[08/02/2024-16:18:27] [I] Memory Clock Rate: 7 GHz
[08/02/2024-16:18:27] [I] 
[08/02/2024-16:18:27] [I] TensorRT version: 8.5.2
[08/02/2024-16:18:27] [I] [TRT] [MemUsageChange] Init CUDA: CPU +308, GPU +0, now: CPU 321, GPU 695 (MiB)
[08/02/2024-16:18:29] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +263, GPU +76, now: CPU 638, GPU 771 (MiB)
[08/02/2024-16:18:29] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See `CUDA_MODULE_LOADING` in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[08/02/2024-16:18:29] [I] Start parsing network model
[08/02/2024-16:18:29] [I] [TRT] ----------------------------------------------------------------
[08/02/2024-16:18:29] [I] [TRT] Input filename:   /my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx
[08/02/2024-16:18:29] [I] [TRT] ONNX IR version:  0.0.7
[08/02/2024-16:18:29] [I] [TRT] Opset version:    14
[08/02/2024-16:18:29] [I] [TRT] Producer name:    pytorch
[08/02/2024-16:18:29] [I] [TRT] Producer version: 1.14.0
[08/02/2024-16:18:29] [I] [TRT] Domain:           
[08/02/2024-16:18:29] [I] [TRT] Model version:    0
[08/02/2024-16:18:29] [I] [TRT] Doc string:       
[08/02/2024-16:18:29] [I] [TRT] ----------------------------------------------------------------
[08/02/2024-16:18:30] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/02/2024-16:18:30] [I] [TRT] No importer registered for op: EfficientNMSX_TRT. Attempting to import as plugin.
[08/02/2024-16:18:30] [I] [TRT] Searching for plugin: EfficientNMSX_TRT, plugin_version: 1, plugin_namespace: 
[08/02/2024-16:18:30] [I] [TRT] Successfully created plugin: EfficientNMSX_TRT
[08/02/2024-16:18:30] [I] [TRT] No importer registered for op: ROIAlign_TRT. Attempting to import as plugin.
[08/02/2024-16:18:30] [I] [TRT] Searching for plugin: ROIAlign_TRT, plugin_version: 1, plugin_namespace: 
[08/02/2024-16:18:30] [I] [TRT] Successfully created plugin: ROIAlign_TRT
[08/02/2024-16:18:30] [I] Finish parsing network model

[08/02/2024-16:18:30] [E] Error[4]: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer /model/model.22/Sub: broadcast dimensions must be conformable)
[08/02/2024-16:18:30] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

[08/02/2024-16:18:30] [E] Engine could not be created from network
[08/02/2024-16:18:30] [E] Building engine failed
[08/02/2024-16:18:30] [E] Failed to create engine from model or file.
[08/02/2024-16:18:30] [E] Engine set up failed

@andrew-93
Copy link

andrew-93 commented Aug 2, 2024

Screenshot from 2024-08-02 16-35-27

Could this problem be related to the fact that dynamic shape is used in "input"?

Screenshot from 2024-08-02 16-42-10

@levipereira
Copy link
Author

levipereira commented Aug 2, 2024

I just followed these step using nvcr.io/nvidia/deepstream:6.2-triton where have TRT v8502 and all is working.

Tip: yolov8x-seg-trt.onnx is not recommend to jetson use small models.

https://github.com/levipereira/deepstream-yolo-e2e

&&&& RUNNING TensorRT.trtexec [TensorRT v8502] # trtexec --onnx=models/yolov8s-seg-trt.onnx --fp16 --saveEngine=models/yolov8s-seg-trt-fp16-netsize-640-batch-2.engine --timingCacheFile=models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache --warmUp=500 --duration=10 --useCudaGraph --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:2x3x640x640
[08/02/2024-14:59:48] [I] === Model Options ===
[08/02/2024-14:59:48] [I] Format: ONNX
[08/02/2024-14:59:48] [I] Model: models/yolov8s-seg-trt.onnx
[08/02/2024-14:59:48] [I] Output:
[08/02/2024-14:59:48] [I] === Build Options ===
[08/02/2024-14:59:48] [I] Max batch: explicit batch
[08/02/2024-14:59:48] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[08/02/2024-14:59:48] [I] minTiming: 1
[08/02/2024-14:59:48] [I] avgTiming: 8
[08/02/2024-14:59:48] [I] Precision: FP32+FP16
[08/02/2024-14:59:48] [I] LayerPrecisions:
[08/02/2024-14:59:48] [I] Calibration:
[08/02/2024-14:59:48] [I] Refit: Disabled
[08/02/2024-14:59:48] [I] Sparsity: Disabled
[08/02/2024-14:59:48] [I] Safe mode: Disabled
[08/02/2024-14:59:48] [I] DirectIO mode: Disabled
[08/02/2024-14:59:48] [I] Restricted mode: Disabled
[08/02/2024-14:59:48] [I] Build only: Disabled
[08/02/2024-14:59:48] [I] Save engine: models/yolov8s-seg-trt-fp16-netsize-640-batch-2.engine
[08/02/2024-14:59:48] [I] Load engine:
[08/02/2024-14:59:48] [I] Profiling verbosity: 0
[08/02/2024-14:59:48] [I] Tactic sources: Using default tactic sources
[08/02/2024-14:59:48] [I] timingCacheMode: global
[08/02/2024-14:59:48] [I] timingCacheFile: models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache
[08/02/2024-14:59:48] [I] Heuristic: Disabled
[08/02/2024-14:59:48] [I] Preview Features: Use default preview flags.
[08/02/2024-14:59:48] [I] Input(s)s format: fp32:CHW
[08/02/2024-14:59:48] [I] Output(s)s format: fp32:CHW
[08/02/2024-14:59:48] [I] Input build shape: images=1x3x640x640+2x3x640x640+2x3x640x640
[08/02/2024-14:59:48] [I] Input calibration shapes: model
[08/02/2024-14:59:48] [I] === System Options ===
[08/02/2024-14:59:48] [I] Device: 0
[08/02/2024-14:59:48] [I] DLACore:
[08/02/2024-14:59:48] [I] Plugins:
[08/02/2024-14:59:48] [I] === Inference Options ===
[08/02/2024-14:59:48] [I] Batch: Explicit
[08/02/2024-14:59:48] [I] Input inference shape: images=2x3x640x640
[08/02/2024-14:59:48] [I] Iterations: 10
[08/02/2024-14:59:48] [I] Duration: 10s (+ 500ms warm up)
[08/02/2024-14:59:48] [I] Sleep time: 0ms
[08/02/2024-14:59:48] [I] Idle time: 0ms
[08/02/2024-14:59:48] [I] Streams: 1
[08/02/2024-14:59:48] [I] ExposeDMA: Disabled
[08/02/2024-14:59:48] [I] Data transfers: Enabled
[08/02/2024-14:59:48] [I] Spin-wait: Disabled
[08/02/2024-14:59:48] [I] Multithreading: Disabled
[08/02/2024-14:59:48] [I] CUDA Graph: Enabled
[08/02/2024-14:59:48] [I] Separate profiling: Disabled
[08/02/2024-14:59:48] [I] Time Deserialize: Disabled
[08/02/2024-14:59:48] [I] Time Refit: Disabled
[08/02/2024-14:59:48] [I] NVTX verbosity: 0
[08/02/2024-14:59:48] [I] Persistent Cache Ratio: 0
[08/02/2024-14:59:48] [I] Inputs:
[08/02/2024-14:59:48] [I] === Reporting Options ===
[08/02/2024-14:59:48] [I] Verbose: Disabled
[08/02/2024-14:59:48] [I] Averages: 10 inferences
[08/02/2024-14:59:48] [I] Percentiles: 90,95,99
[08/02/2024-14:59:48] [I] Dump refittable layers:Disabled
[08/02/2024-14:59:48] [I] Dump output: Disabled
[08/02/2024-14:59:48] [I] Profile: Disabled
[08/02/2024-14:59:48] [I] Export timing to JSON file:
[08/02/2024-14:59:48] [I] Export output to JSON file:
[08/02/2024-14:59:48] [I] Export profile to JSON file:
[08/02/2024-14:59:48] [I]
[08/02/2024-14:59:48] [I] === Device Information ===
[08/02/2024-14:59:48] [I] Selected Device: NVIDIA GeForce RTX 4090
[08/02/2024-14:59:48] [I] Compute Capability: 8.9
[08/02/2024-14:59:48] [I] SMs: 128
[08/02/2024-14:59:48] [I] Compute Clock Rate: 2.58 GHz
[08/02/2024-14:59:48] [I] Device Global Memory: 24208 MiB
[08/02/2024-14:59:48] [I] Shared Memory per SM: 100 KiB
[08/02/2024-14:59:48] [I] Memory Bus Width: 384 bits (ECC disabled)
[08/02/2024-14:59:48] [I] Memory Clock Rate: 10.501 GHz
[08/02/2024-14:59:48] [I]
[08/02/2024-14:59:48] [I] TensorRT version: 8.5.2
[08/02/2024-14:59:48] [I] [TRT] [MemUsageChange] Init CUDA: CPU +15, GPU +0, now: CPU 30, GPU 390 (MiB)
[08/02/2024-14:59:51] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +550, GPU +118, now: CPU 634, GPU 508 (MiB)
[08/02/2024-14:59:51] [I] Start parsing network model
[08/02/2024-14:59:51] [I] [TRT] ----------------------------------------------------------------
[08/02/2024-14:59:51] [I] [TRT] Input filename:   models/yolov8s-seg-trt.onnx
[08/02/2024-14:59:51] [I] [TRT] ONNX IR version:  0.0.7
[08/02/2024-14:59:51] [I] [TRT] Opset version:    14
[08/02/2024-14:59:51] [I] [TRT] Producer name:    pytorch
[08/02/2024-14:59:51] [I] [TRT] Producer version: 1.14.0
[08/02/2024-14:59:51] [I] [TRT] Domain:
[08/02/2024-14:59:51] [I] [TRT] Model version:    0
[08/02/2024-14:59:51] [I] [TRT] Doc string:
[08/02/2024-14:59:51] [I] [TRT] ----------------------------------------------------------------
[08/02/2024-14:59:51] [W] [TRT] onnx2trt_utils.cpp:377: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[08/02/2024-14:59:51] [I] [TRT] No importer registered for op: EfficientNMSX_TRT. Attempting to import as plugin.
[08/02/2024-14:59:51] [I] [TRT] Searching for plugin: EfficientNMSX_TRT, plugin_version: 1, plugin_namespace:
[08/02/2024-14:59:51] [I] [TRT] Successfully created plugin: EfficientNMSX_TRT
[08/02/2024-14:59:51] [I] [TRT] No importer registered for op: ROIAlign_TRT. Attempting to import as plugin.
[08/02/2024-14:59:51] [I] [TRT] Searching for plugin: ROIAlign_TRT, plugin_version: 1, plugin_namespace:
[08/02/2024-14:59:51] [I] [TRT] Successfully created plugin: ROIAlign_TRT
[08/02/2024-14:59:51] [I] Finish parsing network model
[08/02/2024-14:59:51] [W] Could not read timing cache from: models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache. A new timing cache will be generated and written.
[08/02/2024-14:59:51] [W] [TRT] Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.
[08/02/2024-14:59:51] [W] [TRT] Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.
[08/02/2024-14:59:51] [W] [TRT] Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.
[08/02/2024-14:59:51] [W] [TRT] Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.
[08/02/2024-14:59:53] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +6, GPU +10, now: CPU 696, GPU 518 (MiB)
[08/02/2024-14:59:53] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +2, GPU +10, now: CPU 698, GPU 528 (MiB)
[08/02/2024-14:59:53] [I] [TRT] Global timing cache in use. Profiling results in this builder pass will be stored.
[08/02/2024-15:05:47] [I] [TRT] Total Activation Memory: 26017377792
[08/02/2024-15:05:47] [I] [TRT] Detected 1 inputs and 5 output network tensors.
[08/02/2024-15:05:47] [I] [TRT] Total Host Persistent Memory: 223920
[08/02/2024-15:05:47] [I] [TRT] Total Device Persistent Memory: 1324544
[08/02/2024-15:05:47] [I] [TRT] Total Scratch Memory: 32257024
[08/02/2024-15:05:47] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 32 MiB, GPU 8440 MiB
[08/02/2024-15:05:47] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 158 steps to complete.
[08/02/2024-15:05:47] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 17.8975ms to assign 13 blocks to 158 nodes requiring 354408960 bytes.
[08/02/2024-15:05:47] [I] [TRT] Total Activation Memory: 354408960
[08/02/2024-15:05:47] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 1442, GPU 612 (MiB)
[08/02/2024-15:05:47] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1442, GPU 622 (MiB)
[08/02/2024-15:05:47] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[08/02/2024-15:05:47] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[08/02/2024-15:05:47] [W] [TRT] Check verbose logs for the list of affected weights.
[08/02/2024-15:05:47] [W] [TRT] - 69 weights are affected by this issue: Detected subnormal FP16 values.
[08/02/2024-15:05:47] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +23, GPU +24, now: CPU 23, GPU 24 (MiB)
[08/02/2024-15:05:47] [I] [TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1450, GPU 554 (MiB)
[08/02/2024-15:05:47] [I] Saved 889719 bytes of timing cache to models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache
[08/02/2024-15:05:47] [I] Engine built in 359.25 sec.
[08/02/2024-15:05:48] [I] [TRT] Loaded engine size: 25 MiB
[08/02/2024-15:05:48] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 837, GPU 480 (MiB)
[08/02/2024-15:05:48] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 837, GPU 488 (MiB)
[08/02/2024-15:05:48] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +23, now: CPU 0, GPU 23 (MiB)
[08/02/2024-15:05:48] [I] Engine deserialized in 0.0254138 sec.
[08/02/2024-15:05:48] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 837, GPU 480 (MiB)
[08/02/2024-15:05:48] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 837, GPU 488 (MiB)
[08/02/2024-15:05:48] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +340, now: CPU 0, GPU 363 (MiB)
[08/02/2024-15:05:48] [I] Setting persistentCacheLimit to 0 bytes.
[08/02/2024-15:05:48] [I] Using random values for input images
[08/02/2024-15:05:48] [I] Created input binding for images with dimensions 2x3x640x640
[08/02/2024-15:05:48] [I] Using random values for output num_dets
[08/02/2024-15:05:48] [I] Created output binding for num_dets with dimensions 2x1
[08/02/2024-15:05:48] [I] Using random values for output det_boxes
[08/02/2024-15:05:48] [I] Created output binding for det_boxes with dimensions 2x100x4
[08/02/2024-15:05:48] [I] Using random values for output det_scores
[08/02/2024-15:05:48] [I] Created output binding for det_scores with dimensions 2x100
[08/02/2024-15:05:48] [I] Using random values for output det_classes
[08/02/2024-15:05:48] [I] Created output binding for det_classes with dimensions 2x100
[08/02/2024-15:05:48] [I] Using random values for output det_masks
[08/02/2024-15:05:48] [I] Created output binding for det_masks with dimensions 2x100x25600
[08/02/2024-15:05:48] [I] Starting inference
[08/02/2024-15:05:58] [I] Warmup completed 132 queries over 500 ms
[08/02/2024-15:05:58] [I] Timing trace has 2679 queries over 10.0138 s
[08/02/2024-15:05:58] [I]
[08/02/2024-15:05:58] [I] === Trace details ===
[08/02/2024-15:05:58] [I] Trace averages of 10 runs:
[08/02/2024-15:05:58] [I] Average on 10 runs - GPU latency: 2.74412 ms - Host latency: 7.40204 ms (enqueue 0.0157013 ms)
<snipet>
[08/02/2024-15:05:58] [I] Average on 10 runs - GPU latency: 2.74355 ms - Host latency: 7.40195 ms (enqueue 0.015918 ms)
[08/02/2024-15:05:58] [I] Average on 10 runs - GPU latency: 2.74346 ms - Host latency: 7.404 ms (enqueue 0.0166992 ms)
[08/02/2024-15:05:58] [I] Average on 10 runs - GPU latency: 2.74424 ms - Host latency: 7.4085 ms (enqueue 0.0208008 ms)
[08/02/2024-15:05:58] [I] Average on 10 runs - GPU latency: 2.74346 ms - Host latency: 7.40059 ms (enqueue 0.012207 ms)
[08/02/2024-15:05:58] [I] Average on 10 runs - GPU latency: 2.74375 ms - Host latency: 7.40127 ms (enqueue 0.0125 ms)
[08/02/2024-15:05:58] [I]
[08/02/2024-15:05:58] [I] === Performance summary ===
[08/02/2024-15:05:58] [I] Throughput: 267.531 qps
[08/02/2024-15:05:58] [I] Latency: min = 7.23438 ms, max = 7.45508 ms, mean = 7.40586 ms, median = 7.4043 ms, percentile(90%) = 7.41113 ms, percentile(95%) = 7.41895 ms, percentile(99%) = 7.44189 ms
[08/02/2024-15:05:58] [I] Enqueue Time: min = 0.0107422 ms, max = 0.034668 ms, mean = 0.0173812 ms, median = 0.0166016 ms, percentile(90%) = 0.020752 ms, percentile(95%) = 0.0214844 ms, percentile(99%) = 0.0234375 ms
[08/02/2024-15:05:58] [I] H2D Latency: min = 1.47852 ms, max = 1.53516 ms, mean = 1.48216 ms, median = 1.47998 ms, percentile(90%) = 1.4834 ms, percentile(95%) = 1.48828 ms, percentile(99%) = 1.52246 ms
[08/02/2024-15:05:58] [I] GPU Compute Time: min = 2.72656 ms, max = 2.74951 ms, mean = 2.74348 ms, median = 2.74341 ms, percentile(90%) = 2.74536 ms, percentile(95%) = 2.74561 ms, percentile(99%) = 2.74658 ms
[08/02/2024-15:05:58] [I] D2H Latency: min = 3.0293 ms, max = 3.19824 ms, mean = 3.18017 ms, median = 3.17993 ms, percentile(90%) = 3.18518 ms, percentile(95%) = 3.18707 ms, percentile(99%) = 3.19141 ms
[08/02/2024-15:05:58] [I] Total Host Walltime: 10.0138 s
[08/02/2024-15:05:58] [I] Total GPU Compute Time: 7.34979 s
[08/02/2024-15:05:58] [W] * Throughput may be bound by device-to-host transfers for the outputs rather than GPU Compute and the GPU may be under-utilized.
[08/02/2024-15:05:58] [W]   Add --noDataTransfers flag to disable data transfers.
[08/02/2024-15:05:58] [I] Explanations of the performance metrics are printed in the verbose logs.
[08/02/2024-15:05:58] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8502] # trtexec --onnx=models/yolov8s-seg-trt.onnx --fp16 --saveEngine=models/yolov8s-seg-trt-fp16-netsize-640-batch-2.engine --timingCacheFile=models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache --warmUp=500 --duration=10 --useCudaGraph --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:2x3x640x640

@andrew-93
Copy link

andrew-93 commented Aug 5, 2024

@levipereira I was able to successfully create an engine for the yolov8x-seg-trt.onnx model also with resolution 640x640 (1x3x640x640). But the conversion only works for this resolution, which is very strange... Does your plugin have hardcoded fragments for working with 640x640 only?
Conversion into another resolution (I tested for 1024x512, 640x320, 320x320, 512x512, 1280x640) gives this error:

[08/05/2024-17:30:19] [E] Error[4]: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer /model/model.22/Sub: broadcast dimensions must be conformable)
[08/05/2024-17:30:19] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

@levipereira
Copy link
Author

@levipereira I was able to successfully create an engine for the yolov8x-seg-trt.onnx model also with resolution 640x640 (1x3x640x640). But the conversion only works for this resolution, which is very strange... Does your plugin have hardcoded fragments for working with 640x640 only? Conversion into another resolution (I tested for 1024x512, 640x320, 320x320, 512x512, 1280x640) gives this error:

[08/05/2024-17:30:19] [E] Error[4]: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer /model/model.22/Sub: broadcast dimensions must be conformable)
[08/05/2024-17:30:19] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

I didn't check yet, but you can try disable dynamic shape and check it again.

@laugh12321
Copy link

laugh12321 commented Aug 6, 2024

@levipereira I attempted to build TensorRT release 8.6 and TensorRT release 10.0, and also built the EfficientNMSX-related code on the 10.1 and 10.2 branches of NVIDIA TensorRT.

When building and testing TensorRT on the 3060 Ti and 4090, exporting the ONNX model with the EfficientNMSX plugin using the built trtexec works fine. However, when building and testing TensorRT on the 2080 Ti, I always encounter the following error, even when using the binary built on the 3060 Ti or 4090. (I have expanded the memory of my 2080 Ti from 11GB to 22GB; could this be related to the issue?)

2080 Ti
image

3060 Ti
image

4090
cgi-bin_mmwebwx-bin_webwxgetmsgimg_ MsgID=7744685000595196139 skey=@crypt_56b04b1_09eccd4c4d4105688c01450eb59c7de0 mmweb_appid=wx_webfilehelper

Here is the structure of my ONNX model:
image

@laugh12321
Copy link

@andrew-93, you might want to try building TensorRT on a different device. I’ve tested it on the 2080 Ti, 3060 Ti, and 4090, and only the 2080 Ti is not working correctly.

@levipereira
Copy link
Author

@laugh12321 I would recommend trying different workspace sizes, such as 4GB and then 20GB, to see if the issue persists. It's possible that this problem might also occur with the Efficient_NMS plugin, so checking this could provide additional insights.

I have an RTX 2060/2070 and I'll test as soon as possible.

@levipereira
Copy link
Author

@levipereira I was able to successfully create an engine for the yolov8x-seg-trt.onnx model also with resolution 640x640 (1x3x640x640). But the conversion only works for this resolution, which is very strange... Does your plugin have hardcoded fragments for working with 640x640 only? Conversion into another resolution (I tested for 1024x512, 640x320, 320x320, 512x512, 1280x640) gives this error:

[08/05/2024-17:30:19] [E] Error[4]: [graphShapeAnalyzer.cpp::analyzeShapes::1872] Error Code 4: Miscellaneous (IElementWiseLayer /model/model.22/Sub: broadcast dimensions must be conformable)
[08/05/2024-17:30:19] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

This is related to dynamic shapes in YOLOv8. I reused their base code. As a workaround, exporting the ONNX model with the desired input shapes will make it work,(i.e disable dynamic shapes)

@levipereira
Copy link
Author

@levipereira I attempted to build TensorRT release 8.6 and TensorRT release 10.0, and also built the EfficientNMSX-related code on the 10.1 and 10.2 branches of NVIDIA TensorRT.

When building and testing TensorRT on the 3060 Ti and 4090, exporting the ONNX model with the EfficientNMSX plugin using the built trtexec works fine. However, when building and testing TensorRT on the 2080 Ti, I always encounter the following error, even when using the binary built on the 3060 Ti or 4090. (I have expanded the memory of my 2080 Ti from 11GB to 22GB; could this be related to the issue?)


=== Device Information ===
Selected Device: NVIDIA GeForce RTX 2060 SUPER
Compute Capability: 7.5
SMs: 34
Device Global Memory: 8191 MiB
Shared Memory per SM: 64 KiB
Memory Bus Width: 256 bits (ECC disabled)
Application Compute Clock Rate: 1.68 GHz
Application Memory Clock Rate: 7.001 GHz

Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.

TensorRT version: 8.6.1
Loading standard plugins
[TRT] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 27, GPU 1039 (MiB)
[TRT] [MemUsageChange] Init builder kernel library: CPU +897, GPU +174, now: CPU 1000, GPU 1213 (MiB)
.
.
.
[TRT] [MemUsageChange] Init CUDA: CPU +0, GPU +0, now: CPU 1347, GPU 1233 (MiB)
Saved 994757 bytes of timing cache to models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache
Engine built in 360.003 sec.
[TRT] Loaded engine size: 24 MiB
[TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +10, now: CPU 369, GPU 1095 (MiB)
[TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 369, GPU 1103 (MiB)
[TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +24, now: CPU 0, GPU 24 (MiB)
Engine deserialized in 0.04292 sec.
[TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 369, GPU 1095 (MiB)
[TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 369, GPU 1103 (MiB)
[TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +345, now: CPU 0, GPU 369 (MiB)
Setting persistentCacheLimit to 0 bytes.
Using random values for input images
Input binding for images with dimensions 2x3x640x640 is created.
Output binding for num_dets with dimensions 2x1 is created.
Output binding for det_boxes with dimensions 2x100x4 is created.
Output binding for det_scores with dimensions 2x100 is created.
Output binding for det_classes with dimensions 2x100 is created.
Output binding for det_masks with dimensions 2x100x25600 is created.
Starting inference
Warmup completed 33 queries over 500 ms
Timing trace has 654 queries over 10.0459 s
.
.
.
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec 
--onnx=models/yolov8s-seg-trt.onnx --fp16 
--saveEngine=models/yolov8s-seg-trt-fp16-netsize-640-batch-2.engine
--timingCacheFile=models/yolov8s-seg-trt-fp16-netsize-640.engine.timing.cache
 --warmUp=500 
--duration=10 
--useCudaGraph 
--minShapes=images:1x3x640x640
 --optShapes=images:2x3x640x640
 --maxShapes=images:2x3x640x640

image

@laugh12321
Copy link

我建议尝试不同的工作空间大小,例如 4GB 和 20GB,看看问题是否仍然存在。Efficient_NMS插件也可能会出现此问题,因此检查此内容可以提供额外的见解。

我有一台 RTX 2060/2070,我会尽快测试。

[08/07/2024-07:53:41] [I] === Model Options ===
[08/07/2024-07:53:41] [I] Format: ONNX
[08/07/2024-07:53:41] [I] Model: D:\laugh\Projects\TensorRT-YOLO\demo\obb\models\yolov8s-obb.onnx
[08/07/2024-07:53:41] [I] Output:
[08/07/2024-07:53:41] [I] === Build Options ===
[08/07/2024-07:53:41] [I] Memory Pools: workspace: 2.14748e+10 MiB, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default, tacticSharedMem: default
[08/07/2024-07:53:41] [I] avgTiming: 8
[08/07/2024-07:53:41] [I] Precision: FP32+FP16
[08/07/2024-07:53:41] [I] LayerPrecisions:
[08/07/2024-07:53:41] [I] Layer Device Types:
[08/07/2024-07:53:41] [I] Calibration:
[08/07/2024-07:53:41] [I] Refit: Disabled
[08/07/2024-07:53:41] [I] Strip weights: Disabled
[08/07/2024-07:53:41] [I] Version Compatible: Disabled
[08/07/2024-07:53:41] [I] ONNX Plugin InstanceNorm: Disabled
[08/07/2024-07:53:41] [I] TensorRT runtime: full
[08/07/2024-07:53:41] [I] Lean DLL Path:
[08/07/2024-07:53:41] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[08/07/2024-07:53:41] [I] Exclude Lean Runtime: Disabled
[08/07/2024-07:53:41] [I] Sparsity: Disabled
[08/07/2024-07:53:41] [I] Safe mode: Disabled
[08/07/2024-07:53:41] [I] Build DLA standalone loadable: Disabled
[08/07/2024-07:53:41] [I] Allow GPU fallback for DLA: Disabled
[08/07/2024-07:53:41] [I] DirectIO mode: Disabled
[08/07/2024-07:53:41] [I] Restricted mode: Disabled
[08/07/2024-07:53:41] [I] Skip inference: Disabled
[08/07/2024-07:53:41] [I] Save engine: D:\laugh\Projects\TensorRT-YOLO\demo\obb\models\yolov8s-obb.engine
[08/07/2024-07:53:41] [I] Load engine:
[08/07/2024-07:53:41] [I] Profiling verbosity: 0
[08/07/2024-07:53:41] [I] Tactic sources: Using default tactic sources
[08/07/2024-07:53:41] [I] timingCacheMode: local
[08/07/2024-07:53:41] [I] timingCacheFile:
[08/07/2024-07:53:41] [I] Enable Compilation Cache: Enabled
[08/07/2024-07:53:41] [I] errorOnTimingCacheMiss: Disabled
[08/07/2024-07:53:41] [I] Preview Features: Use default preview flags.
[08/07/2024-07:53:41] [I] MaxAuxStreams: -1
[08/07/2024-07:53:41] [I] BuilderOptimizationLevel: -1
[08/07/2024-07:53:41] [I] Calibration Profile Index: 0
[08/07/2024-07:53:41] [I] Weight Streaming: Disabled
[08/07/2024-07:53:41] [I] Runtime Platform: Same As Build
[08/07/2024-07:53:41] [I] Debug Tensors:
[08/07/2024-07:53:41] [I] Input(s)s format: fp32:CHW
[08/07/2024-07:53:41] [I] Output(s)s format: fp32:CHW
[08/07/2024-07:53:41] [I] Input build shapes: model
[08/07/2024-07:53:41] [I] Input calibration shapes: model
[08/07/2024-07:53:41] [I] === System Options ===
[08/07/2024-07:53:41] [I] Device: 0
[08/07/2024-07:53:41] [I] DLACore:
[08/07/2024-07:53:41] [I] Plugins:
[08/07/2024-07:53:41] [I] setPluginsToSerialize:
[08/07/2024-07:53:41] [I] dynamicPlugins:
[08/07/2024-07:53:41] [I] ignoreParsedPluginLibs: 0
[08/07/2024-07:53:41] [I]
[08/07/2024-07:53:41] [I] === Inference Options ===
[08/07/2024-07:53:41] [I] Batch: Explicit
[08/07/2024-07:53:41] [I] Input inference shapes: model
[08/07/2024-07:53:41] [I] Iterations: 10
[08/07/2024-07:53:41] [I] Duration: 3s (+ 200ms warm up)
[08/07/2024-07:53:41] [I] Sleep time: 0ms
[08/07/2024-07:53:41] [I] Idle time: 0ms
[08/07/2024-07:53:41] [I] Inference Streams: 1
[08/07/2024-07:53:41] [I] ExposeDMA: Disabled
[08/07/2024-07:53:41] [I] Data transfers: Enabled
[08/07/2024-07:53:41] [I] Spin-wait: Disabled
[08/07/2024-07:53:41] [I] Multithreading: Disabled
[08/07/2024-07:53:41] [I] CUDA Graph: Disabled
[08/07/2024-07:53:41] [I] Separate profiling: Disabled
[08/07/2024-07:53:41] [I] Time Deserialize: Disabled
[08/07/2024-07:53:41] [I] Time Refit: Disabled
[08/07/2024-07:53:41] [I] NVTX verbosity: 0
[08/07/2024-07:53:41] [I] Persistent Cache Ratio: 0
[08/07/2024-07:53:41] [I] Optimization Profile Index: 0
[08/07/2024-07:53:41] [I] Weight Streaming Budget: 100.000000%
[08/07/2024-07:53:41] [I] Inputs:
[08/07/2024-07:53:41] [I] Debug Tensor Save Destinations:
[08/07/2024-07:53:41] [I] === Reporting Options ===
[08/07/2024-07:53:41] [I] Verbose: Disabled
[08/07/2024-07:53:41] [I] Averages: 10 inferences
[08/07/2024-07:53:41] [I] Percentiles: 90,95,99
[08/07/2024-07:53:41] [I] Dump refittable layers:Disabled
[08/07/2024-07:53:41] [I] Dump output: Disabled
[08/07/2024-07:53:41] [I] Profile: Disabled
[08/07/2024-07:53:41] [I] Export timing to JSON file:
[08/07/2024-07:53:41] [I] Export output to JSON file:
[08/07/2024-07:53:41] [I] Export profile to JSON file:
[08/07/2024-07:53:41] [I]
[08/07/2024-07:53:41] [I] === Device Information ===
[08/07/2024-07:53:41] [I] Available Devices:
[08/07/2024-07:53:41] [I]   Device 0: "NVIDIA GeForce RTX 2080 Ti" UUID: GPU-c0370922-f0e4-5f2b-5f7a-16ae5ab03013
[08/07/2024-07:53:41] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti
[08/07/2024-07:53:41] [I] Selected Device ID: 0
[08/07/2024-07:53:41] [I] Selected Device UUID: GPU-c0370922-f0e4-5f2b-5f7a-16ae5ab03013
[08/07/2024-07:53:41] [I] Compute Capability: 7.5
[08/07/2024-07:53:41] [I] SMs: 68
[08/07/2024-07:53:41] [I] Device Global Memory: 22527 MiB
[08/07/2024-07:53:41] [I] Shared Memory per SM: 64 KiB
[08/07/2024-07:53:41] [I] Memory Bus Width: 352 bits (ECC disabled)
[08/07/2024-07:53:41] [I] Application Compute Clock Rate: 1.755 GHz
[08/07/2024-07:53:41] [I] Application Memory Clock Rate: 7 GHz
[08/07/2024-07:53:41] [I]
[08/07/2024-07:53:41] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[08/07/2024-07:53:41] [I]
[08/07/2024-07:53:41] [I] TensorRT version: 10.2.0
[08/07/2024-07:53:41] [I] Loading standard plugins
[08/07/2024-07:53:42] [I] [TRT] [MemUsageChange] Init CUDA: CPU +402, GPU +0, now: CPU 8696, GPU 1275 (MiB)
[08/07/2024-07:53:43] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1300, GPU +180, now: CPU 10309, GPU 1455 (MiB)
[08/07/2024-07:53:43] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[08/07/2024-07:53:43] [I] Start parsing network model.
[08/07/2024-07:53:43] [I] [TRT] ----------------------------------------------------------------
[08/07/2024-07:53:43] [I] [TRT] Input filename:   D:\laugh\Projects\TensorRT-YOLO\demo\obb\models\yolov8s-obb.onnx
[08/07/2024-07:53:43] [I] [TRT] ONNX IR version:  0.0.6
[08/07/2024-07:53:43] [I] [TRT] Opset version:    11
[08/07/2024-07:53:43] [I] [TRT] Producer name:    pytorch
[08/07/2024-07:53:43] [I] [TRT] Producer version: 2.4.0
[08/07/2024-07:53:43] [I] [TRT] Domain:
[08/07/2024-07:53:43] [I] [TRT] Model version:    0
[08/07/2024-07:53:43] [I] [TRT] Doc string:
[08/07/2024-07:53:43] [I] [TRT] ----------------------------------------------------------------
[08/07/2024-07:53:43] [I] [TRT] No checker registered for op: EfficientNMSX_TRT. Attempting to check as plugin.
[08/07/2024-07:53:43] [I] [TRT] No importer registered for op: EfficientNMSX_TRT. Attempting to import as plugin.
[08/07/2024-07:53:43] [I] [TRT] Searching for plugin: EfficientNMSX_TRT, plugin_version: 1, plugin_namespace:
[08/07/2024-07:53:43] [I] [TRT] Successfully created plugin: EfficientNMSX_TRT
[08/07/2024-07:53:43] [I] Finished parsing network model. Parse time: 0.0738637
[08/07/2024-07:53:43] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/07/2024-07:53:43] [I] [TRT] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[08/07/2024-07:53:43] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[08/07/2024-07:57:22] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception Assertion pluginUtils::isSuccess(status) failed.
[08/07/2024-07:57:22] [E] Error[9]: Error Code: 9: Skipping tactic 0x0000000000000000 due to exception Assertion pluginUtils::isSuccess(status) failed.
[08/07/2024-07:57:22] [E] Error[10]: IBuilder::buildSerializedNetwork: Error Code 10: Internal Error (Could not find any implementation for node /model.22/EfficientNMSX_TRT.)
[08/07/2024-07:57:22] [E] Engine could not be created from network
[08/07/2024-07:57:22] [E] Building engine failed
[08/07/2024-07:57:22] [E] Failed to create engine from model or file.
[08/07/2024-07:57:22] [E] Engine set up failed
&&&& FAILED TensorRT.trtexec [TensorRT v100200] # D:\laugh\Downloads\TensorRT-v10.2.0.19_build_by_CUDA-v12.4_cuDNN-v8.9.7.29\v10.2.0.19\bin\trtexec.exe --onnx=D:\laugh\Projects\TensorRT-YOLO\demo\obb\models\yolov8s-obb.onnx --saveEngine=D:\laugh\Projects\TensorRT-YOLO\demo\obb\models\yolov8s-obb.engine --fp16 --memPoolSize=workspace:21474836480

@andrew-93
Copy link

andrew-93 commented Aug 8, 2024

I solved this problem. Now you can set any (even non-square) input resolution:
https://github.com/andrew-93/ultralytics/commits/NMSX/
Don't forget to set bhwc_for_det_masks: False since you are using levipereira's (not my custom) inference [see my README]

For example, you need to create pt-->onnx-->engine, and the input resolution you want to set: width 1024, height 512

pt-->onnx
python3 yolov8_onnxtrt.py -w yolov8x-seg.pt -s 1024x512

onnx-->engine
/usr/src/tensorrt/bin/trtexec --onnx=/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.onnx --saveEngine=/my_folder/BEST_WEIGHTS/yolov8x-seg-trt.engine --shapes=images:1x3x512x1024 --fp16 --explicitBatch --workspace=4096 --useCudaGraph

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants