Description
Description
At the moment, the Tensorflow/ONNX Predictor APIs will compute a prediction immediately as a request is registered. Instead, let the API accumulate batch_size
requests and then run the inference on the computing hardware. If a pool of batch_size
requests can't be fulfilled in a given batch_timeout
timeframe, then just run the inferences on what it's got at the moment.
Add a batch_size
field in the configuration file to set a different bach size. By default, the field's value should be set to 1.
Also, add batch_timeout
field to the configuration file to tune the API latency and throughput when batch_size
> 1.
Motivation
This increases throughput substantially. This is dependent on the underlying used hardware, the used model, and the rate of incoming requests the API is experiencing.