Skip to content

Add support for server-side batch processing on Tensorflow/ONNX Predictors #1060

Closed
@RobertLucian

Description

@RobertLucian

Description

At the moment, the Tensorflow/ONNX Predictor APIs will compute a prediction immediately as a request is registered. Instead, let the API accumulate batch_size requests and then run the inference on the computing hardware. If a pool of batch_size requests can't be fulfilled in a given batch_timeout timeframe, then just run the inferences on what it's got at the moment.

Add a batch_size field in the configuration file to set a different bach size. By default, the field's value should be set to 1.
Also, add batch_timeout field to the configuration file to tune the API latency and throughput when batch_size > 1.

Motivation

This increases throughput substantially. This is dependent on the underlying used hardware, the used model, and the rate of incoming requests the API is experiencing.

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestexampleCreate or improve an example

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions