-
Notifications
You must be signed in to change notification settings - Fork 603
Description
Description
At the moment, the Tensorflow/ONNX Predictor APIs will compute a prediction immediately as a request is registered. Instead, let the API accumulate batch_size requests and then run the inference on the computing hardware. If a pool of batch_size requests can't be fulfilled in a given batch_timeout timeframe, then just run the inferences on what it's got at the moment.
Add a batch_size field in the configuration file to set a different bach size. By default, the field's value should be set to 1.
Also, add batch_timeout field to the configuration file to tune the API latency and throughput when batch_size > 1.
Motivation
This increases throughput substantially. This is dependent on the underlying used hardware, the used model, and the rate of incoming requests the API is experiencing.