Skip to content

Latest commit

 

History

History
109 lines (79 loc) · 6.13 KB

README.md

File metadata and controls

109 lines (79 loc) · 6.13 KB

English | 中文

Usage of FastDeploy model multi-thread or multi-process prediction

FastDeploy provides the following multi-thread or multi-process examples for python and cpp developers

Models that currently support multi-thread and multi-process predictions

task type illustrate model download link
Detection support PaddleDetection series models PaddleDetection
Segmentation support PaddleSeg series models PaddleSeg
Classification support PaddleClas series models PaddleClas
OCR support PaddleOCR series models PaddleOCR

Notice:

  • click the model download link above to download the model from the Download pre-training model module
  • OCR is a pipeline model. For multi-thread examples, please refer to the pipeline folder. Other single-model multi-thread examples are in the single_model folder.

Clone model when using multi-thread prediction

the inference process of vision model is consist of three stages

  • load the image, then the image is preprocessed, finally get the Tensor to be input to the model Runtime, that is the preprocess stage
  • the model Runtime receives Tensor, do the inference, and obtains the output tensor of Runtime, that is the infer stage
  • process the output tensor of Runtime to get the final structured information, such as DetectionResult, SegmentationResult, etc., that is the postprocess stage

For the above three stages: preprocess, inference, and postprocess, FastDeploy abstracted three corresponding classes, namely Preprocessor, Runtime, and PostProcessor

When using FastDeploy for multi-thread inference, several issues should be considered

  • Can the Preprocessor, Runtime, and Postprocessor support parallel processing respectively?
  • 在支持多线程并发的前提下,能否最大限度的减少内存或显存占用
  • Under the premise of supporting multi-thread concurrency, can the memory or video memory usage be minimized?

FastDeploy adopts the method of copying multiple objects separately for multi-thread inference, so each thread has an independent instance of Preprocessor, Runtime, and PostProcessor. In order to reduce the memory usage, the Runtime adopt sharing the model weights copy method. In this way, the memory usage caused by copying multiple objects is reduced.

FastDeploy provides the following interface to clone the model (take PaddleClas as an example)

  • Python: PaddleClasModel.clone()
  • C++: PaddleClasModel::Clone()

Python

import fastdeploy as fd
option = fd.RuntimeOption()
model = fd.vision.classification.PaddleClasModel(model_file,
                                                 params_file,
                                                 config_file,
                                                 runtime_option=option)
model2 = model.clone()
im = cv2.imread(image)
res = model.predict(im)

C++

auto model = fastdeploy::vision::classification::PaddleClasModel(model_file,
                                                                 params_file,
                                                                 config_file,
                                                                 option);
auto model2 = model.Clone();
auto im = cv::imread(image_file);
fastdeploy::vision::ClassifyResult res;
model->Predict(im, &res)

Notice:Other models API refer to官方C++文档 and 官方Python文档

Python multi-thread and multi-process

Due to language limitations, Python has the existence of GIL lock. In computing-intensive scenarios, multithreading cannot make full use of hardware resources. Therefore, two examples of multi-process and multi-thread are provided on Python. The similarities and differences are as follows:

Comparison of multi-process and multi-thread inference in FastDeploy model

resource usage computationally intensive I/O intensive inter-process or inter-thread communication
multi-process large fast fast slow
multi-thread little slow relatively fast fast

注意: The above analysis is a theoretical analysis. In fact, Python has also made certain optimizations for different computing tasks. For example, the calculation of numpy can already be computed by multi-thread parallelly. In addition, the result aggregation between multiple processes involves time-consuming operation(inter-process communication), Besides, it is difficult to identify whether the task is computationally intensive or I/O intensive, so everything needs to be tested according to the task.

C++ multi-thread

The C++ multi-thread has the characteristics of occupying less resources and high speed.Therefore, multi-threaded inference is the best choice in C++

C++ comparition between multi-thread Clone and not Clone memory occupation

硬件:Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz
模型:ResNet50_vd_infer
后端:CPU OPENVINO Backend

memory occupation of initializing multiple models in a single process

number of models after model.Clone() after model->predict() with model.Clone() initializing model without model.Clone() after model->predict() without model.Clone()
1 322M 325M 322M 325M
2 322M 325M 559M 560M
3 322M 325M 771M 771M

memory occupation of multi-thread

thread number after model.Clone() after model->predict() with model.Clone() initialize model without model.Clone() after model->predict() without model.Clone()
1 322M 337M 322M 337M
2 322M 343M 548M 566M
3 322M 347M 752M 784M