-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
Description
🚀 The feature, motivation and pitch
I will describe here in more detail what was said here.
We use the OpenAI interface from vLLM in our ETL service, our main task is batch generation and vector matching.
When processing large batch responses, we notice the collector overhead and the resource consumption of receiving uncompressed JSON text, which contains a vector of values in Base64 strings. This results in significant memory usage and garbage collection, and also puts a strain on the network.
I'd like to suggest several improvements.
-
Response compression if the client sends the 'accept-encoding' header: 'zstd, gzip'
-
This is quite radical, but would be very effective. Add a new request parameter, for example,
is_binary_response. If true, return the response not as JSON, but simply as a binary tuple. We always know the tuple length when making a request, so it will be very easy to parse a binary response with this structure:Index: UInt16, tokens: UInt32, vector: FixedList -
This is a small thing, but still, endian, our ETL service often doesn't work with the vector, it just saves it to another database, for example, in Postgres, whose binary protocol requires values to always be transmitted in bigendian, we have to convert the vector to the required endian, this is not difficult, but it will be useful if there is a query parameter - endian, which will control the endian of response.
Thanks.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.