[Feature]: Improvements to front-end embedding response

### 🚀 The feature, motivation and pitch

I will describe here in more detail what was said [here](https://github.com/vllm-project/vllm/pull/26414#issuecomment-3396192879).

We use the OpenAI interface from vLLM in our ETL service, our main task is batch generation and vector matching.

When processing large batch responses, we notice the collector overhead and the resource consumption of receiving uncompressed JSON text, which contains a vector of values ​​in Base64 strings. This results in significant memory usage and garbage collection, and also puts a strain on the network.

I'd like to suggest several improvements.

1. Response compression if the client sends the 'accept-encoding' header: 'zstd, gzip'

2. This is quite radical, but would be very effective. Add a new request parameter, for example, `is_binary_response`. If true, return the response not as JSON, but simply as a binary tuple. We always know the tuple length when making a request, so it will be very easy to parse a binary response with this structure: `Index: UInt16, tokens: UInt32, vector: FixedList`

3. This is a small thing, but still, endian, our ETL service often doesn't work with the vector, it just saves it to another database, for example, in Postgres, whose binary protocol requires values ​​to always be transmitted in bigendian, we have to convert the vector to the required endian, this is not difficult, but it will be useful if there is a query parameter - endian, which will control the endian of response.

Thanks.

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Improvements to front-end embedding response #27063

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Improvements to front-end embedding response #27063

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions