TensorRT推理速度不达预期 #52

zero-c1 · 2024-07-19T02:19:53Z

感谢作者分享工作，我已经将GPS-Gaussian的网络部分全部转换成TensorRT引擎，并开启了fp16优化。但在2048x1024分辨率下测试得仅推理时间就需要约60ms，不应该是实时推理应有的速度。请问是什么地方出了问题？

以下是trtexec的输出：

&&&& RUNNING TensorRT.trtexec [TensorRT v100100] # /home/lisi/programs/TensorRT-10.1.0.27/bin/trtexec --device=7 --fp16 --loadEngine=gps_gaussian_2048x1024_v3_GSRegressor_fp16.plan --profilingVerbosity=detailed --separateProfileRun
[07/19/2024-10:08:48] [I] === Model Options ===
[07/19/2024-10:08:48] [I] Format: *
[07/19/2024-10:08:48] [I] Model:
[07/19/2024-10:08:48] [I] Output:
[07/19/2024-10:08:48] [I]
[07/19/2024-10:08:48] [I] === System Options ===
[07/19/2024-10:08:48] [I] Device: 7
[07/19/2024-10:08:48] [I] DLACore:
[07/19/2024-10:08:48] [I] setPluginsToSerialize:
[07/19/2024-10:08:48] [I] dynamicPlugins:
[07/19/2024-10:08:48] [I] ignoreParsedPluginLibs: 0
[07/19/2024-10:08:48] [I]
[07/19/2024-10:08:48] [I] === Inference Options ===
[07/19/2024-10:08:48] [I] Batch: Explicit
[07/19/2024-10:08:48] [I] Input inference shapes: model
[07/19/2024-10:08:48] [I] Iterations: 10
[07/19/2024-10:08:48] [I] Duration: 3s (+ 200ms warm up)
[07/19/2024-10:08:48] [I] Sleep time: 0ms
[07/19/2024-10:08:48] [I] Idle time: 0ms
[07/19/2024-10:08:48] [I] Inference Streams: 1
[07/19/2024-10:08:48] [I] ExposeDMA: Disabled
[07/19/2024-10:08:48] [I] Data transfers: Enabled
[07/19/2024-10:08:48] [I] Spin-wait: Disabled
[07/19/2024-10:08:48] [I] Multithreading: Disabled
[07/19/2024-10:08:48] [I] CUDA Graph: Disabled
[07/19/2024-10:08:48] [I] Separate profiling: Enabled
[07/19/2024-10:08:48] [I] Time Deserialize: Disabled
[07/19/2024-10:08:48] [I] Time Refit: Disabled
[07/19/2024-10:08:48] [I] NVTX verbosity: 2
[07/19/2024-10:08:48] [I] Persistent Cache Ratio: 0
[07/19/2024-10:08:48] [I] Optimization Profile Index: 0
[07/19/2024-10:08:48] [I] Weight Streaming Budget: 100.000000%
[07/19/2024-10:08:48] [I] Inputs:
[07/19/2024-10:08:48] [I] Debug Tensor Save Destinations:
[07/19/2024-10:08:48] [I] === Reporting Options ===
[07/19/2024-10:08:48] [I] Verbose: Disabled
[07/19/2024-10:08:48] [I] Averages: 10 inferences
[07/19/2024-10:08:48] [I] Percentiles: 90,95,99
[07/19/2024-10:08:48] [I] Dump refittable layers:Disabled
[07/19/2024-10:08:48] [I] Dump output: Disabled
[07/19/2024-10:08:48] [I] Profile: Disabled
[07/19/2024-10:08:48] [I] Export timing to JSON file:
[07/19/2024-10:08:48] [I] Export output to JSON file:
[07/19/2024-10:08:48] [I] Export profile to JSON file:
[07/19/2024-10:08:48] [I]
[07/19/2024-10:08:48] [I] === Device Information ===
[07/19/2024-10:08:48] [I] Available Devices:
[07/19/2024-10:08:48] [I] Device 0: "NVIDIA GeForce RTX 3090" UUID: GPU-5cbd64b3-e27d-c315-e47f-8021a921a2a6
[07/19/2024-10:08:48] [I] Device 1: "NVIDIA GeForce RTX 3090" UUID: GPU-49724fd3-a532-5cc1-40ec-95f61b422435
[07/19/2024-10:08:48] [I] Device 2: "NVIDIA GeForce RTX 3090" UUID: GPU-db1421bf-c0f5-950f-4558-81ccf560b9e9
[07/19/2024-10:08:48] [I] Device 3: "NVIDIA GeForce RTX 3090" UUID: GPU-54ca94dd-1e87-f703-4eda-8b67421049eb
[07/19/2024-10:08:48] [I] Device 4: "NVIDIA GeForce RTX 3090" UUID: GPU-067038dd-32f4-d18c-0a04-65e2dff9a5d1
[07/19/2024-10:08:48] [I] Device 5: "NVIDIA GeForce RTX 3090" UUID: GPU-1422801a-9804-49b9-6a25-589116dfcc3a
[07/19/2024-10:08:48] [I] Device 6: "NVIDIA GeForce RTX 3090" UUID: GPU-a121abe0-a1be-610f-2b3b-4392d8656abf
[07/19/2024-10:08:48] [I] Device 7: "NVIDIA GeForce RTX 3090" UUID: GPU-ff72cf34-16bb-377f-0874-7ac3f979d967
[07/19/2024-10:08:48] [I] Selected Device: NVIDIA GeForce RTX 3090
[07/19/2024-10:08:48] [I] Selected Device ID: 7
[07/19/2024-10:08:48] [I] Selected Device UUID: GPU-ff72cf34-16bb-377f-0874-7ac3f979d967
[07/19/2024-10:08:48] [I] Compute Capability: 8.6
[07/19/2024-10:08:48] [I] SMs: 82
[07/19/2024-10:08:48] [I] Device Global Memory: 24259 MiB
[07/19/2024-10:08:48] [I] Shared Memory per SM: 100 KiB
[07/19/2024-10:08:48] [I] Memory Bus Width: 384 bits (ECC disabled)
[07/19/2024-10:08:48] [I] Application Compute Clock Rate: 1.695 GHz
[07/19/2024-10:08:48] [I] Application Memory Clock Rate: 9.751 GHz
[07/19/2024-10:08:48] [I]
[07/19/2024-10:08:48] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[07/19/2024-10:08:48] [I]
[07/19/2024-10:08:48] [I] TensorRT version: 10.1.0
[07/19/2024-10:08:48] [I] Loading standard plugins
[07/19/2024-10:08:48] [I] [TRT] Loaded engine size: 70 MiB
[07/19/2024-10:08:48] [I] Engine deserialized in 0.093354 sec.
[07/19/2024-10:08:48] [I] [TRT] [MS] Running engine with multi stream info
[07/19/2024-10:08:48] [I] [TRT] [MS] Number of aux streams is 7
[07/19/2024-10:08:48] [I] [TRT] [MS] Number of total worker streams is 8
[07/19/2024-10:08:48] [I] [TRT] [MS] The main stream provided by execute/enqueue calls is the first worker stream
[07/19/2024-10:08:48] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1578, now: CPU 0, GPU 1642 (MiB)
[07/19/2024-10:08:48] [I] Setting persistentCacheLimit to 0 bytes.
[07/19/2024-10:08:48] [I] Created execution context with device memory size: 1575 MiB
[07/19/2024-10:08:48] [I] Using random values for input color
[07/19/2024-10:08:48] [I] Input binding for color with dimensions 2x3x2048x1024 is created.
[07/19/2024-10:08:48] [I] Using random values for input mask
[07/19/2024-10:08:48] [I] Input binding for mask with dimensions 2x1x2048x1024 is created.
[07/19/2024-10:08:48] [I] Using random values for input intr
[07/19/2024-10:08:48] [I] Input binding for intr with dimensions 2x3x3 is created.
[07/19/2024-10:08:48] [I] Using random values for input ref_intr
[07/19/2024-10:08:48] [I] Input binding for ref_intr with dimensions 2x3x3 is created.
[07/19/2024-10:08:48] [I] Using random values for input extr
[07/19/2024-10:08:48] [I] Input binding for extr with dimensions 2x4x4 is created.
[07/19/2024-10:08:48] [I] Using random values for input Tf_x
[07/19/2024-10:08:48] [I] Input binding for Tf_x with dimensions 2 is created.
[07/19/2024-10:08:48] [I] Output binding for 2223 is dynamic and will be created during execution using OutputAllocator.
[07/19/2024-10:08:48] [I] Output binding for 2246 is dynamic and will be created during execution using OutputAllocator.
[07/19/2024-10:08:48] [I] Output binding for 2251 is dynamic and will be created during execution using OutputAllocator.
[07/19/2024-10:08:48] [I] Output binding for 2263 is dynamic and will be created during execution using OutputAllocator.
[07/19/2024-10:08:48] [I] Output binding for 2275 is dynamic and will be created during execution using OutputAllocator.
[07/19/2024-10:08:48] [I] Starting inference
[07/19/2024-10:08:52] [I] Warmup completed 1 queries over 200 ms
[07/19/2024-10:08:52] [I] Timing trace has 49 queries over 2.99263 s
[07/19/2024-10:08:52] [I]
[07/19/2024-10:08:52] [I] === Trace details ===
[07/19/2024-10:08:52] [I] Trace averages of 10 runs:
[07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 61.2022 ms - Host latency: 74.8459 ms (enqueue 61.688 ms)
[07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 60.9374 ms - Host latency: 74.2716 ms (enqueue 60.8798 ms)
[07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 60.583 ms - Host latency: 73.9158 ms (enqueue 60.5313 ms)
[07/19/2024-10:08:52] [I] Average on 10 runs - GPU latency: 60.364 ms - Host latency: 73.7574 ms (enqueue 60.3091 ms)
[07/19/2024-10:08:52] [I]
[07/19/2024-10:08:52] [I] === Performance summary ===
[07/19/2024-10:08:52] [I] Throughput: 16.3736 qps
[07/19/2024-10:08:52] [I] Latency: min = 72.8015 ms, max = 78.688 ms, mean = 74.1499 ms, median = 73.7568 ms, percentile(90%) = 75.6145 ms, percentile(95%) = 75.8404 ms, percentile(99%) = 78.688 ms
[07/19/2024-10:08:52] [I] Enqueue Time: min = 59.4814 ms, max = 67.623 ms, mean = 60.8126 ms, median = 60.3682 ms, percentile(90%) = 62.266 ms, percentile(95%) = 62.4333 ms, percentile(99%) = 67.623 ms
[07/19/2024-10:08:52] [I] H2D Latency: min = 3.12207 ms, max = 6.19547 ms, mean = 3.30093 ms, median = 3.22266 ms, percentile(90%) = 3.36646 ms, percentile(95%) = 3.38403 ms, percentile(99%) = 6.19547 ms
[07/19/2024-10:08:52] [I] GPU Compute Time: min = 59.9183 ms, max = 64.3553 ms, mean = 60.7554 ms, median = 60.4353 ms, percentile(90%) = 62.3135 ms, percentile(95%) = 62.3442 ms, percentile(99%) = 64.3553 ms
[07/19/2024-10:08:52] [I] D2H Latency: min = 9.27002 ms, max = 10.1483 ms, mean = 10.0936 ms, median = 10.1113 ms, percentile(90%) = 10.1284 ms, percentile(95%) = 10.1348 ms, percentile(99%) = 10.1483 ms
[07/19/2024-10:08:52] [I] Total Host Walltime: 2.99263 s
[07/19/2024-10:08:52] [I] Total GPU Compute Time: 2.97701 s
[07/19/2024-10:08:52] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[07/19/2024-10:08:52] [W] If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[07/19/2024-10:08:52] [W] * GPU compute time is unstable, with coefficient of variance = 1.40592%.
[07/19/2024-10:08:52] [W] If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[07/19/2024-10:08:52] [I] Explanations of the performance metrics are printed in the verbose logs.
[07/19/2024-10:08:52] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v100100] # /home/lisi/programs/TensorRT-10.1.0.27/bin/trtexec --device=7 --fp16 --loadEngine=gps_gaussian_2048x1024_v3_GSRegressor_fp16.plan --profilingVerbosity=detailed --separateProfileRun

ShunyuanZheng · 2024-07-19T02:29:22Z

请问这个1024×2048是输入图像分辨率吗？

zero-c1 · 2024-07-19T02:33:28Z

是的，输入是两张三通道的1024×2048的图像和一张单通道的mask

ShunyuanZheng · 2024-07-19T02:37:03Z

输入图像分辨率太大了，论文里的30fps是1024×1024输入的，输出是2048×2048，应该是您这里输入大了一倍，导致速度变慢了

zero-c1 · 2024-07-22T02:14:57Z

感谢，论文里的30fps是只用GPS-Gaussian网络推理时间所计算出来的帧率吗？demo里演示的新视角实时渲染帧率有多少？

ShunyuanZheng · 2024-07-22T02:27:18Z

30fps包含matting和网络推理，demo演示的时候由于还包括相机视频流的读取等处理，所以GPS-Gaussian的demo展示中实际效率没有到30fps，但是这部分时间是可以优化的，可以看一下Tele-Aloha这篇我们后续提出的实时通信系统文章，四个视点输入下，仍然可以达到30fps。

HuaijiaLin · 2024-09-24T09:52:45Z

@ShunyuanZheng 您好，感谢分享Tele-Aloha这篇工作，但读了之后有两个地方不太明白，想请教一下。

Section 4.3里的“warp all predicted depth maps zˆ𝑖 in Sec. 4.1 onto the novel view, resulting in a relatively dense fused depth map” 这一步里的warp和fuse是用什么实现的呢？我的理解是这里已知四个视点的depth，要求解novel view下的depth。看起来像是一个forward warping的操作，但是warping之后似乎不太好fuse。
Section 4.1里的"These points are rasterized to the viewpoints ofcam2 and cam3"这里的rasterize用的是什么方法呢？

谢谢！

ShunyuanZheng · 2024-09-25T05:38:01Z

@ShunyuanZheng 您好，感谢分享Tele-Aloha这篇工作，但读了之后有两个地方不太明白，想请教一下。

Section 4.3里的“warp all predicted depth maps zˆ𝑖 in Sec. 4.1 onto the novel view, resulting in a relatively dense fused depth map” 这一步里的warp和fuse是用什么实现的呢？我的理解是这里已知四个视点的depth，要求解novel view下的depth。看起来像是一个forward warping的操作，但是warping之后似乎不太好fuse。

Section 4.1里的"These points are rasterized to the viewpoints ofcam2 and cam3"这里的rasterize用的是什么方法呢？

谢谢！

您好，warp就是一个前向的dibr操作，warping之后是得到三个源视点warp后的图像（不是四个哈，tele-aloha顶上的两个视点只用了一个视点的rgb信息），fuse的操作是跟低分辨率的feature map一起做融合的，就是用网络fuse在一起的，warp+fuse的整个操作类似于这篇文章只不过网络部分的使用略有不同，做法上是类似的。第二个问题，这里是把窄基线相机得到的点云渲染成长基线相机的深度图，基于z-buffer的深度渲染都是可以的，我们使用taichi实现的。

HuaijiaLin · 2024-09-25T05:49:11Z

您好，warp就是一个前向的dibr操作，warping之后是得到三个源视点warp后的图像（不是四个哈，tele-aloha顶上的两个视点只用了一个视点的rgb信息），fuse的操作是跟低分辨率的feature map一起做融合的，就是用网络fuse在一起的，warp+fuse的整个操作类似于这篇文章只不过网络部分的使用略有不同，做法上是类似的。第二个问题，这里是把窄基线相机得到的点云渲染成长基线相机的深度图，基于z-buffer的深度渲染都是可以的，我们使用taichi实现的。

谢谢回复，您提到的这篇文章里的dibr操作使用pytorch3d来得到novel view下warp之后的图像，请问你们实现的时候也使用了pytorch3d吗？

ShunyuanZheng · 2024-09-25T05:51:40Z

您好，warp就是一个前向的dibr操作，warping之后是得到三个源视点warp后的图像（不是四个哈，tele-aloha顶上的两个视点只用了一个视点的rgb信息），fuse的操作是跟低分辨率的feature map一起做融合的，就是用网络fuse在一起的，warp+fuse的整个操作类似于这篇文章只不过网络部分的使用略有不同，做法上是类似的。第二个问题，这里是把窄基线相机得到的点云渲染成长基线相机的深度图，基于z-buffer的深度渲染都是可以的，我们使用taichi实现的。

谢谢回复，您提到的这篇文章里的dibr操作使用pytorch3d来得到novel view下warp之后的图像，请问你们实现的时候也使用了pytorch3d吗？

不是，我们是用taichi自己写的kernel来实现的，速度上会快一些。

HuaijiaLin · 2024-09-25T05:53:22Z

不是，我们是用taichi自己写的kernel来实现的，速度上会快一些。

理解了。谢谢回复！

zero-c1 · 2024-11-20T03:50:24Z

感谢回复，还有个问题是live demo中从相机输出（3000x4096）到网络输入（1024x1024）经过了哪些预处理？我们的场景半径也是2米，镜头是 8mm 1:1.4，如果只进行裁剪，图像无法覆盖完整人体。

ShunyuanZheng · 2024-11-20T06:27:06Z

裁剪出完整的人体之后resize到1024，应该是裁剪2048×2048再rezize到1024×1024

t973288913 · 2024-12-11T10:53:12Z

是的，输入是两张三通道的1024×2048的图像和一张单通道的mask

hi,可以分享一下转onnx的代码吗？

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT推理速度不达预期 #52

TensorRT推理速度不达预期 #52

zero-c1 commented Jul 19, 2024

ShunyuanZheng commented Jul 19, 2024

zero-c1 commented Jul 19, 2024

ShunyuanZheng commented Jul 19, 2024

zero-c1 commented Jul 22, 2024

ShunyuanZheng commented Jul 22, 2024

HuaijiaLin commented Sep 24, 2024

ShunyuanZheng commented Sep 25, 2024

HuaijiaLin commented Sep 25, 2024

ShunyuanZheng commented Sep 25, 2024

HuaijiaLin commented Sep 25, 2024

zero-c1 commented Nov 20, 2024

ShunyuanZheng commented Nov 20, 2024 •

edited

Loading

t973288913 commented Dec 11, 2024

TensorRT推理速度不达预期 #52

TensorRT推理速度不达预期 #52

Comments

zero-c1 commented Jul 19, 2024

ShunyuanZheng commented Jul 19, 2024

zero-c1 commented Jul 19, 2024

ShunyuanZheng commented Jul 19, 2024

zero-c1 commented Jul 22, 2024

ShunyuanZheng commented Jul 22, 2024

HuaijiaLin commented Sep 24, 2024

ShunyuanZheng commented Sep 25, 2024

HuaijiaLin commented Sep 25, 2024

ShunyuanZheng commented Sep 25, 2024

HuaijiaLin commented Sep 25, 2024

zero-c1 commented Nov 20, 2024

ShunyuanZheng commented Nov 20, 2024 • edited Loading

t973288913 commented Dec 11, 2024

ShunyuanZheng commented Nov 20, 2024 •

edited

Loading