You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, the Tensorflow/ONNX Predictor APIs will compute a prediction immediately as a request is registered. Instead, let the API accumulate batch_size requests and then run the inference on the computing hardware. If a pool of batch_size requests can't be fulfilled in a given batch_timeout timeframe, then just run the inferences on what it's got at the moment.
Add a batch_size field in the configuration file to set a different bach size. By default, the field's value should be set to 1.
Also, add batch_timeout field to the configuration file to tune the API latency and throughput when batch_size > 1.
Motivation
This increases throughput substantially. This is dependent on the underlying used hardware, the used model, and the rate of incoming requests the API is experiencing.
Description
At the moment, the Tensorflow/ONNX Predictor APIs will compute a prediction immediately as a request is registered. Instead, let the API accumulate
batch_size
requests and then run the inference on the computing hardware. If a pool ofbatch_size
requests can't be fulfilled in a givenbatch_timeout
timeframe, then just run the inferences on what it's got at the moment.Add a
batch_size
field in the configuration file to set a different bach size. By default, the field's value should be set to 1.Also, add
batch_timeout
field to the configuration file to tune the API latency and throughput whenbatch_size
> 1.Motivation
This increases throughput substantially. This is dependent on the underlying used hardware, the used model, and the rate of incoming requests the API is experiencing.
Additional context
The text was updated successfully, but these errors were encountered: