Skip to content

neural-bits/vision-ai-course

Repository files navigation

2024-10-01 16:21:40 | INFO | ONNXExporter | Starting ONNX to TensorRT conversion. 2024-10-01 16:21:41 | INFO | ONNXExporter | Starting container nvcr.io/nvidia/tensorrt:22.10-py3 with GPU support and volume mapping. 2024-10-01 16:21:42 | INFO | ONNXExporter | Docker container started successfully. 2024-10-01 16:21:42 | INFO | ONNXExporter | Running TensorRT conversion command: trtexec --onnx=/workspace/yolov11m.onnx --saveEngine=/workspace/model.plan --fp16 --minShapes=images:1x3x640x640 --optShapes=images:4x3x640x640 --maxShapes=images:8x3x640x640 2024-10-01 16:32:52 | INFO | ONNXExporter | Conversion successful. 2024-10-01 16:32:53 | INFO | ONNXExporter | &&&& RUNNING TensorRT.trtexec [TensorRT v8500] # trtexec --onnx=/workspace/yolov11m.onnx --saveEngine=/workspace/model.plan --fp16 --minShapes=images:1x3x640x640 --optShapes=images:4x3x640x640 --maxShapes=images:8x3x640x640 [10/01/2024-13:21:44] [I] === Model Options === [10/01/2024-13:21:44] [I] Format: ONNX [10/01/2024-13:21:44] [I] Model: /workspace/yolov11m.onnx [10/01/2024-13:21:44] [I] Output: [10/01/2024-13:21:44] [I] === Build Options === [10/01/2024-13:21:44] [I] Max batch: explicit batch [10/01/2024-13:21:44] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default [10/01/2024-13:21:44] [I] minTiming: 1 [10/01/2024-13:21:44] [I] avgTiming: 8 [10/01/2024-13:21:44] [I] Precision: FP32+FP16 [10/01/2024-13:21:44] [I] LayerPrecisions: [10/01/2024-13:21:44] [I] Calibration: [10/01/2024-13:21:44] [I] Refit: Disabled [10/01/2024-13:21:44] [I] Sparsity: Disabled [10/01/2024-13:21:44] [I] Safe mode: Disabled [10/01/2024-13:21:44] [I] DirectIO mode: Disabled [10/01/2024-13:21:44] [I] Restricted mode: Disabled [10/01/2024-13:21:44] [I] Build only: Disabled [10/01/2024-13:21:44] [I] Save engine: /workspace/model.plan [10/01/2024-13:21:44] [I] Load engine: [10/01/2024-13:21:44] [I] Profiling verbosity: 0 [10/01/2024-13:21:44] [I] Tactic sources: Using default tactic sources [10/01/2024-13:21:44] [I] timingCacheMode: local [10/01/2024-13:21:44] [I] timingCacheFile: [10/01/2024-13:21:44] [I] Heuristic: Disabled [10/01/2024-13:21:44] [I] Preview Features: Use default preview flags. [10/01/2024-13:21:44] [I] Input(s)s format: fp32:CHW [10/01/2024-13:21:44] [I] Output(s)s format: fp32:CHW [10/01/2024-13:21:44] [I] Input build shape: images=1x3x640x640+4x3x640x640+8x3x640x640 [10/01/2024-13:21:44] [I] Input calibration shapes: model [10/01/2024-13:21:44] [I] === System Options === [10/01/2024-13:21:44] [I] Device: 0 [10/01/2024-13:21:44] [I] DLACore: [10/01/2024-13:21:44] [I] Plugins: [10/01/2024-13:21:44] [I] === Inference Options === [10/01/2024-13:21:44] [I] Batch: Explicit [10/01/2024-13:21:44] [I] Input inference shape: images=4x3x640x640 [10/01/2024-13:21:44] [I] Iterations: 10 [10/01/2024-13:21:44] [I] Duration: 3s (+ 200ms warm up) [10/01/2024-13:21:44] [I] Sleep time: 0ms [10/01/2024-13:21:44] [I] Idle time: 0ms [10/01/2024-13:21:44] [I] Streams: 1 [10/01/2024-13:21:44] [I] ExposeDMA: Disabled [10/01/2024-13:21:44] [I] Data transfers: Enabled [10/01/2024-13:21:44] [I] Spin-wait: Disabled [10/01/2024-13:21:44] [I] Multithreading: Disabled [10/01/2024-13:21:44] [I] CUDA Graph: Disabled [10/01/2024-13:21:44] [I] Separate profiling: Disabled [10/01/2024-13:21:44] [I] Time Deserialize: Disabled [10/01/2024-13:21:44] [I] Time Refit: Disabled [10/01/2024-13:21:44] [I] NVTX verbosity: 0 [10/01/2024-13:21:44] [I] Persistent Cache Ratio: 0 [10/01/2024-13:21:44] [I] Inputs: [10/01/2024-13:21:44] [I] === Reporting Options === [10/01/2024-13:21:44] [I] Verbose: Disabled [10/01/2024-13:21:44] [I] Averages: 10 inferences [10/01/2024-13:21:44] [I] Percentiles: 90,95,99 [10/01/2024-13:21:44] [I] Dump refittable layers:Disabled [10/01/2024-13:21:44] [I] Dump output: Disabled [10/01/2024-13:21:44] [I] Profile: Disabled [10/01/2024-13:21:44] [I] Export timing to JSON file: [10/01/2024-13:21:44] [I] Export output to JSON file: [10/01/2024-13:21:44] [I] Export profile to JSON file: [10/01/2024-13:21:44] [I] [10/01/2024-13:21:44] [I] === Device Information === [10/01/2024-13:21:44] [I] Selected Device: NVIDIA GeForce RTX 2080 Ti [10/01/2024-13:21:44] [I] Compute Capability: 7.5 [10/01/2024-13:21:44] [I] SMs: 68 [10/01/2024-13:21:44] [I] Compute Clock Rate: 1.545 GHz [10/01/2024-13:21:44] [I] Device Global Memory: 11011 MiB [10/01/2024-13:21:44] [I] Shared Memory per SM: 64 KiB [10/01/2024-13:21:44] [I] Memory Bus Width: 352 bits (ECC disabled) [10/01/2024-13:21:44] [I] Memory Clock Rate: 7 GHz [10/01/2024-13:21:44] [I] [10/01/2024-13:21:44] [I] TensorRT version: 8.5.0 [10/01/2024-13:21:45] [I] [TRT] [MemUsageChange] Init CUDA: CPU +306, GPU +0, now: CPU 319, GPU 312 (MiB) [10/01/2024-13:21:47] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +260, GPU +74, now: CPU 632, GPU 386 (MiB) [10/01/2024-13:21:47] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars [10/01/2024-13:21:47] [I] Start parsing network model [10/01/2024-13:21:47] [I] [TRT] ---------------------------------------------------------------- [10/01/2024-13:21:47] [I] [TRT] Input filename: /workspace/yolov11m.onnx [10/01/2024-13:21:47] [I] [TRT] ONNX IR version: 0.0.10 [10/01/2024-13:21:47] [I] [TRT] Opset version: 19 [10/01/2024-13:21:47] [I] [TRT] Producer name: pytorch [10/01/2024-13:21:47] [I] [TRT] Producer version: 2.4.1 [10/01/2024-13:21:47] [I] [TRT] Domain:
[10/01/2024-13:21:47] [I] [TRT] Model version: 0 [10/01/2024-13:21:47] [I] [TRT] Doc string:
[10/01/2024-13:21:47] [I] [TRT] ---------------------------------------------------------------- [10/01/2024-13:21:47] [W] [TRT] parsers/onnx/onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32. [10/01/2024-13:21:47] [I] Finish parsing network model [10/01/2024-13:21:49] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +463, GPU +192, now: CPU 1192, GPU 586 (MiB) [10/01/2024-13:21:49] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +115, GPU +52, now: CPU 1307, GPU 638 (MiB) [10/01/2024-13:21:49] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored. [10/01/2024-13:25:26] [W] [TRT] Cache result detected as invalid for node: /model.9/m_2/MaxPool, LayerImpl: CaskPooling, tactic: 0x457fb0a334d63ae2 [10/01/2024-13:27:02] [I] [TRT] Detected 1 inputs and 3 output network tensors. [10/01/2024-13:27:02] [I] [TRT] Total Host Persistent Memory: 288992 [10/01/2024-13:27:02] [I] [TRT] Total Device Persistent Memory: 1981952 [10/01/2024-13:27:02] [I] [TRT] Total Scratch Memory: 0 [10/01/2024-13:27:02] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 40 MiB, GPU 8974 MiB [10/01/2024-13:27:02] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 70.8212ms to assign 13 blocks to 219 nodes requiring 273613832 bytes. [10/01/2024-13:27:02] [I] [TRT] Total Activation Memory: 273613832 [10/01/2024-13:27:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1828, GPU 882 (MiB) [10/01/2024-13:27:02] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy. [10/01/2024-13:27:02] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights. [10/01/2024-13:27:02] [W] [TRT] Check verbose logs for the list of affected weights. [10/01/2024-13:27:02] [W] [TRT] - 103 weights are affected by this issue: Detected subnormal FP16 values. [10/01/2024-13:27:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +39, GPU +40, now: CPU 39, GPU 40 (MiB) [10/01/2024-13:27:02] [I] Engine built in 318.021 sec. [10/01/2024-13:27:02] [I] [TRT] Loaded engine size: 40 MiB [10/01/2024-13:27:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 1485, GPU 804 (MiB) [10/01/2024-13:27:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +40, now: CPU 0, GPU 40 (MiB) [10/01/2024-13:27:02] [I] Engine deserialized in 0.0326257 sec. [10/01/2024-13:27:02] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 1486, GPU 806 (MiB) [10/01/2024-13:27:02] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars [10/01/2024-13:27:02] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +263, now: CPU 0, GPU 303 (MiB) [10/01/2024-13:27:02] [I] Setting persistentCacheLimit to 0 bytes. [10/01/2024-13:27:02] [I] Using random values for input images [10/01/2024-13:27:03] [I] Created input binding for images with dimensions 4x3x640x640 [10/01/2024-13:27:03] [I] Using random values for output output0 [10/01/2024-13:27:03] [I] Created output binding for output0 with dimensions 4x84x8400 [10/01/2024-13:27:03] [I] Starting inference [10/01/2024-13:27:06] [I] Warmup completed 28 queries over 200 ms [10/01/2024-13:27:06] [I] Timing trace has 410 queries over 3.02161 s [10/01/2024-13:27:06] [I] [10/01/2024-13:27:06] [I] === Trace details === [10/01/2024-13:27:06] [I] Trace averages of 10 runs: [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.28331 ms - Host latency: 9.78616 ms (enqueue 1.30835 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.31797 ms - Host latency: 9.82058 ms (enqueue 1.2998 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.30121 ms - Host latency: 9.8033 ms (enqueue 1.27233 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.27156 ms - Host latency: 9.76617 ms (enqueue 1.26566 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.34681 ms - Host latency: 9.84652 ms (enqueue 1.35998 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.28901 ms - Host latency: 9.78999 ms (enqueue 1.27892 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.28663 ms - Host latency: 9.78557 ms (enqueue 1.31749 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.29865 ms - Host latency: 9.80212 ms (enqueue 1.27586 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.25928 ms - Host latency: 9.75422 ms (enqueue 1.32923 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.35859 ms - Host latency: 9.85643 ms (enqueue 1.28542 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.3587 ms - Host latency: 9.8573 ms (enqueue 1.32992 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.29194 ms - Host latency: 9.78958 ms (enqueue 1.34174 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.28684 ms - Host latency: 9.78763 ms (enqueue 1.36134 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.2998 ms - Host latency: 9.79867 ms (enqueue 1.28751 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.23649 ms - Host latency: 9.72839 ms (enqueue 1.37578 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.32729 ms - Host latency: 9.82485 ms (enqueue 1.35154 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.33502 ms - Host latency: 9.83679 ms (enqueue 1.30608 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.33231 ms - Host latency: 9.83311 ms (enqueue 1.35973 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.29429 ms - Host latency: 9.79397 ms (enqueue 1.29397 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.3264 ms - Host latency: 9.82222 ms (enqueue 1.13252 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.28718 ms - Host latency: 9.78649 ms (enqueue 1.21685 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.39557 ms - Host latency: 9.90116 ms (enqueue 1.25012 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.36726 ms - Host latency: 9.87059 ms (enqueue 1.35223 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.35522 ms - Host latency: 9.85743 ms (enqueue 1.34802 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.31547 ms - Host latency: 9.81656 ms (enqueue 1.36642 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.33381 ms - Host latency: 9.83654 ms (enqueue 1.42034 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.32703 ms - Host latency: 9.82683 ms (enqueue 1.2645 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.33293 ms - Host latency: 9.83184 ms (enqueue 1.35762 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.35146 ms - Host latency: 9.85215 ms (enqueue 1.33872 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.31404 ms - Host latency: 9.81743 ms (enqueue 1.29753 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.35864 ms - Host latency: 9.85999 ms (enqueue 1.35154 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.32095 ms - Host latency: 9.81885 ms (enqueue 1.29243 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.32578 ms - Host latency: 9.82539 ms (enqueue 1.39197 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.33469 ms - Host latency: 9.83662 ms (enqueue 1.32568 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.3686 ms - Host latency: 9.86856 ms (enqueue 1.26868 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.32788 ms - Host latency: 9.82847 ms (enqueue 1.30271 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.37122 ms - Host latency: 9.87229 ms (enqueue 1.25476 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.34409 ms - Host latency: 9.8446 ms (enqueue 1.34783 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.3197 ms - Host latency: 9.8218 ms (enqueue 1.33259 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.34622 ms - Host latency: 9.84602 ms (enqueue 1.30027 ms) [10/01/2024-13:27:06] [I] Average on 10 runs - GPU latency: 7.30942 ms - Host latency: 9.8114 ms (enqueue 1.40342 ms) [10/01/2024-13:27:06] [I] [10/01/2024-13:27:06] [I] === Performance summary === [10/01/2024-13:27:06] [I] Throughput: 135.689 qps [10/01/2024-13:27:06] [I] Latency: min = 9.5257 ms, max = 10.0364 ms, mean = 9.82231 ms, median = 9.82507 ms, percentile(90%) = 9.88892 ms, percentile(95%) = 9.9043 ms, percentile(99%) = 9.96838 ms [10/01/2024-13:27:06] [I] Enqueue Time: min = 0.921631 ms, max = 2.23669 ms, mean = 1.31506 ms, median = 1.32135 ms, percentile(90%) = 1.61548 ms, percentile(95%) = 1.71899 ms, percentile(99%) = 1.81921 ms [10/01/2024-13:27:06] [I] H2D Latency: min = 1.6131 ms, max = 1.67749 ms, mean = 1.63842 ms, median = 1.63818 ms, percentile(90%) = 1.64844 ms, percentile(95%) = 1.65234 ms, percentile(99%) = 1.65768 ms [10/01/2024-13:27:06] [I] GPU Compute Time: min = 7.02802 ms, max = 7.53619 ms, mean = 7.32218 ms, median = 7.32544 ms, percentile(90%) = 7.38696 ms, percentile(95%) = 7.3988 ms, percentile(99%) = 7.46777 ms [10/01/2024-13:27:06] [I] D2H Latency: min = 0.85791 ms, max = 0.874146 ms, mean = 0.861711 ms, median = 0.861511 ms, percentile(90%) = 0.862793 ms, percentile(95%) = 0.863159 ms, percentile(99%) = 0.869995 ms [10/01/2024-13:27:06] [I] Total Host Walltime: 3.02161 s [10/01/2024-13:27:06] [I] Total GPU Compute Time: 3.00209 s [10/01/2024-13:27:06] [I] Explanations of the performance metrics are printed in the verbose logs. [10/01/2024-13:27:06] [I] &&&& PASSED TensorRT.trtexec [TensorRT v8500] # trtexec --onnx=/workspace/yolov11m.onnx --saveEngine=/workspace/model.plan --fp16 --minShapes=images:1x3x640x640 --optShapes=images:4x3x640x640 --maxShapes=images:8x3x640x640

2024-10-01 16:32:57 | INFO | ONNXExporter | Stopping and removing Docker container.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published