-
Notifications
You must be signed in to change notification settings - Fork 37
PerformanceMetrics
The software is verified using the Inference Engine component of the Intel Distribution of OpenVINO Toolkit and several other frameworks.
The OpenVINO toolkit provides two inference modes.
- Latency mode. This mode involves creating and executing a single request to infer the model on the selected device. The next inference request is created after completing of the previous one. During performance analysis, the number of requests is determined by the iterations number of the test loop. Latency mode minimizes inference time of the single request.
- Throughtput mode. It involves creating a set of requests to infer the neural network on the selected device. The order of requests completion is an arbitrary one. The number of requests sets is determined by the number of iterations of the test loop. Throughput mode minimizes inference time of the overal requests set.
Inference Engine provides two programming interfaces.
- Sync API is used to implement latency mode.
- Async API is used to implement latency mode if a single request is created, and throughput mode, otherwise.
A single inference request corresponds to the feed forward of the neural network for a batch of images. Required test parameters:
- batch size,
- number of iterations (the number of time taken to infer one request for the latency mode and a set of requests for the througput mode),
- number of requests created in throughput mode.
Inference can be executed in multi-threading mode. The number of threads is an inference parameter (by default, it equals the number of phisycal cores).
For throughput mode there is a possibility to execute requests in parallel using streams. Stream is a group of physical threads. The number of streams is a parameter too.
Due to the fact that the OpenVINO toolkit provides two inference modes, performance measurements are taken for each mode. Evaluating inference performance for the latency mode, requests are executed sequentially. The next request is infered after the completion of the previous one. For each request, its duration time is measured. The standard deviation is calculated on the set of obtained durations and the ones that goes beyond three standard deviations relative to the mean inference time are discarded. The final set of times is used to calculate the performance metrics for the latency mode.
- Latency is a median of execution times.
- Average time of a single pass is the ratio of the total execution time of all iterations to the number of iterations.
- Batch FPS is the ratio of the batch size to the latency.
- FPS is the ratio of the total number of processed images to the total execution time.
For the throughput mode, performance metrics are provided below.
- Average time of a single pass is the ratio of the execution time of all requests sets to the iterations number of the test loop. It is the execution time of a set of simultaneously created requests on the device.
- Batch FPS is the ratio of the product of the batch size and the iterations number to the execution time of all requests.
- FPS is the ratio of the total number of processed images to the total execution time.
Along with the OpenVINO toolkit, DLI supports inference using Intel Optimization for Caffe and Intel Optimization for TensorFlow. Intel Optimization for Caffe and Intel Optimization for TensorFlow work in only one mode, similar to the latency mode of the OpenVINO toolkit. Therefore, the corresponding performance metrics are valid for these frameworks.
From some version of the software, Latency and Average time of a single pass are not published on the project web-page, since the FPS idicator is more representative, and some indicators can be calculated based on the FPS.