Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python统计耗时明显高于模型内置统计耗时 #2536

Open
ucsk opened this issue Oct 15, 2024 · 2 comments
Open

Python统计耗时明显高于模型内置统计耗时 #2536

ucsk opened this issue Oct 15, 2024 · 2 comments
Assignees

Comments

@ucsk
Copy link

ucsk commented Oct 15, 2024

环境

  • 【FastDeploy版本】:fastdeploy-linux-gpu-0.0.0
  • 【系统平台】: Linux x64 (Ubuntu 20.04)
  • 【硬件】: Nvidia GPU 4060Ti, conda-forge: CUDA 11.7, CUDNN 8.4
  • 【编译语言】: Python (3.10)

性能疑问

  • FastDeploy模型内置的压测统计耗时和Python层面的统计耗时不一致的问题。
  • 如何缩小内置耗时57.6966ms和Python耗时118ms的差距。
import time
import fastdeploy as fd
import numpy as np
import statistics


if __name__ == '__main__':
    option = fd.RuntimeOption()
    option.use_gpu(0)
    option.use_trt_backend()
    option.trt_option.enable_fp16 = True
    option.trt_option.set_shape('images', [1, 3, 640, 640], [1, 3, 640, 640], [40, 3, 640, 640])
    option.trt_option.serialize_file = 'weights/yolov8m.engine'
    model = fd.vision.detection.YOLOv8('weights/yolov8m.onnx', runtime_option=option)

    ims = [np.random.randint(0, 256, (360, 640, 3), dtype=np.uint8) for _ in range(20)]

    model.enable_record_time_of_runtime()
    costs = []
    for i in range(500):
        if 100 <= i:
            begin = time.perf_counter()
        results = model.batch_predict(ims)
        if 100 <= i:
            costs.append(time.perf_counter() - begin)
    model.print_statis_info_of_runtime()

    print(f'{int(1000 * statistics.mean(costs))}ms')
$ python benchmark.py 
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(719)::CreateTrtEngineFromOnnx	Detect serialized TensorRT Engine file in weights/yolov8m.engine, will load it directly.
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(108)::LoadTrtCache	Build TensorRT Engine from cache file: weights/yolov8m.engine with shape range information as below,
[INFO] fastdeploy/runtime/backends/tensorrt/trt_backend.cc(111)::LoadTrtCache	Input name: images, shape=[-1, 3, -1, -1], min=[1, 3, 640, 640], max=[40, 3, 640, 640]

[INFO] fastdeploy/runtime/runtime.cc(339)::CreateTrtBackend	Runtime initialized with Backend::TRT in Device::GPU.
============= Runtime Statis Info(yolov8) =============
Total iterations: 500
Total time of runtime: 29.7184s.
Warmup iterations: 100
Total time of runtime in warmup step: 6.63981s.
Average time of runtime exclude warmup step: 57.6966ms.
118ms
@Jiang-Jia-Jun
Copy link
Collaborator

模型内置的,统计的单纯是推理引擎的耗时。 而Python端,统计的是包含数据前后处理+推理引擎耗时

@ucsk
Copy link
Author

ucsk commented Oct 23, 2024

模型内置的,统计的单纯是推理引擎的耗时。 而Python端,统计的是包含数据前后处理+推理引擎耗时

目前YOLOv8的预处理没有继承ProcessorManager,不支持CVCUDA加速。

请问如果适配这部分代码之后,如何正确的在Python将默认的预处理替换为CVCUDA?

是否仅需要初始化模型并调用接口model.preprocessor.use_cuda(True, 0)

model = fd.vision.detection.YOLOv8(...)
# model.preprocessor.use_cuda(True, 0)  # CPU
model.preprocessor.use_cuda(False, 0)  # CUDA
model.preprocessor.use_cuda(True, 0)  # CVCUDA

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants