http stream response via HTTP 1.1 chunked encoding #2232

lxning · 2023-04-16T01:51:14Z

🚀 The feature

TorchServe supports streaming response for both HTTP and GRPC endpoint.

GRPC server side streaming see TorchServe grpc server side streaming #2180
HTTP chunked encoding will be implemented in this ticket

Motivation, pitch

Usually the predication latency is high (eg. 5sec) for a large model inference. Some models are able to generate intermediate prediction results (eg. generator AI). This feature will send the intermediate prediction results to user once the results are ready. The user will gradually get the entire response. For example, User may get first intermediate response within 1 sec, and gradually get the entire result until 5sec. This feature is to improve user prediction experience.

Alternatives

No response

Additional context

No response

lxning self-assigned this Apr 16, 2023

lxning added the enhancement New feature or request label Apr 16, 2023

lxning added this to the v0.8.0 milestone Apr 16, 2023

This was referenced Apr 16, 2023

http stream response via http 1.1 chunked encoding #2233

Merged

TorchServe inference stream response support #2234

Open

lxning changed the title ~~http1.1 stream response via chunked encoding~~ http stream response via HTTP 1.1 chunked encoding Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

http stream response via HTTP 1.1 chunked encoding #2232

http stream response via HTTP 1.1 chunked encoding #2232

lxning commented Apr 16, 2023

http stream response via HTTP 1.1 chunked encoding #2232

http stream response via HTTP 1.1 chunked encoding #2232

Comments

lxning commented Apr 16, 2023

🚀 The feature

Motivation, pitch

Alternatives

Additional context