[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066

noooop · 2025-10-17T02:15:54Z

TL;DR

support endianness: ["native", "big", "little"], native by default
support bytes encoding_format, a very simple (but highly efficient) binary embedding response method
- First, write metadata into the headers, then write all the binary data in order into Response body.
- When reading, first read the headers, then read the body according to the offset into the corresponding tenser
This API provides three benefits @uasan

Significant reduction in response size
No need for JSON.parse and Base64.decode
Possibility of stream processing of the response
Endianness customizations

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

Response compression if the client sends the 'accept-encoding' header: 'zstd, gzip'
Use Response compression to transmit binary files that don't need base64, cool. base64 is very inefficient
I wouldn't think of this hacky method before seeing this issue.

Thanks @uasan for this cool idea

Fix #27063

cc @christian-pinto @maxdebayser @DarkLight1337

Test Plan

tests/utils_/test_serial_utils.py
tests/entrypoints/pooling/openai/test_embedding.py
tests/entrypoints/pooling/openai/test_pooling.py

Test Result

pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <noooop@126.com>

vllm/entrypoints/openai/utils.py

Signed-off-by: wang.yuqi <noooop@126.com>

vllm/entrypoints/openai/protocol.py

Signed-off-by: wang.yuqi <noooop@126.com>

mergify · 2025-10-20T07:56:42Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wang.yuqi <noooop@126.com>

mergify · 2025-10-20T08:05:05Z

Documentation preview: https://vllm--27066.org.readthedocs.build/en/27066/

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-20T16:21:41Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an excellent feature for improving performance by supporting binary embedding responses, including a new bytes encoding format and endianness customization. The refactoring of serialization logic into vllm/utils/serial_utils.py is a positive structural change.

However, my review has identified a critical data corruption bug in the new serialization logic specifically for bfloat16 tensors. The current implementation incorrectly reinterprets bfloat16 bit patterns as float16, which leads to corrupted data. This issue is not caught by the existing tests because they only verify round-trip consistency, where the symmetric corruption cancels itself out. I've provided a detailed comment with a code suggestion to fix this critical bug.

vllm/utils/serial_utils.py

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-22T05:10:59Z

@DarkLight1337

Are there any more modifications needed for this PR?

vllm/entrypoints/openai/serving_embedding.py

vllm/entrypoints/openai/api_server.py

…esponse (vllm-project#27066) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

…esponse (vllm-project#27066) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>

…esponse (vllm-project#27066) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

…esponse (vllm-project#27066) Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>

noooop changed the title ~~[Frontend][1/N] Improve all pooling task | Support binary Embedding response by response compression~~ [Frontend][3/N] Improve all pooling task | Support binary Embedding response by response compression Oct 17, 2025

mergify bot added the frontend label Oct 17, 2025

noooop mentioned this pull request Oct 17, 2025

[Frontend][4/N] Improve all pooling task | Add plugin pooling task #26973

Merged

5 tasks

init

92a6d2e

Signed-off-by: wang.yuqi <noooop@126.com>

noooop force-pushed the binary_response branch from 32e102c to 92a6d2e Compare October 17, 2025 02:27

init

e54fe8b

Signed-off-by: wang.yuqi <noooop@126.com>

noooop force-pushed the binary_response branch from 223eb21 to e54fe8b Compare October 17, 2025 02:34

noooop added 2 commits October 17, 2025 11:37

+ response_compression_pooling_output

83803c4

Signed-off-by: wang.yuqi <noooop@126.com>

fix

fc826d1

Signed-off-by: wang.yuqi <noooop@126.com>

noooop commented Oct 17, 2025

View reviewed changes

vllm/entrypoints/openai/utils.py Outdated Show resolved Hide resolved

noooop changed the title ~~[Frontend][3/N] Improve all pooling task | Support binary Embedding response by response compression~~ [Frontend][3/N] Improve all pooling task | Support binary embedding response Oct 18, 2025

noooop changed the title ~~[Frontend][3/N] Improve all pooling task | Support binary embedding response~~ [Frontend][4/N] Improve all pooling task | Support binary embedding response Oct 18, 2025

support endianness

3b10e99

Signed-off-by: wang.yuqi <noooop@126.com>

uasan reviewed Oct 19, 2025

View reviewed changes

vllm/entrypoints/openai/protocol.py Outdated Show resolved Hide resolved

+ tensor_serial

fa03e6c

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot added the needs-rebase label Oct 20, 2025

+ embedding_requests_base64_client

b033bda

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot added the documentation Improvements or additions to documentation label Oct 20, 2025

Merge branch 'main' into binary_response

24bb3cf

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot removed the needs-rebase label Oct 20, 2025

noooop added 4 commits October 20, 2025 16:13

fix

63502bb

Signed-off-by: wang.yuqi <noooop@126.com>

typo

c8a6d0f

Signed-off-by: wang.yuqi <noooop@126.com>

Support binary embedding response

f4882a4

Signed-off-by: wang.yuqi <noooop@126.com>

+ embedding_requests_bytes_client.py

836fc05

Signed-off-by: wang.yuqi <noooop@126.com>

fix

acc4b50

Signed-off-by: wang.yuqi <noooop@126.com>

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

vllm/utils/serial_utils.py Show resolved Hide resolved

maxdebayser reviewed Oct 20, 2025

View reviewed changes

vllm/utils/serial_utils.py Show resolved Hide resolved

maxdebayser reviewed Oct 20, 2025

View reviewed changes

vllm/utils/serial_utils.py Outdated Show resolved Hide resolved

maxdebayser reviewed Oct 20, 2025

View reviewed changes

vllm/utils/serial_utils.py Outdated Show resolved Hide resolved

maxdebayser reviewed Oct 20, 2025

View reviewed changes

vllm/utils/serial_utils.py Outdated Show resolved Hide resolved

noooop commented Oct 21, 2025

View reviewed changes

vllm/utils/serial_utils.py Show resolved Hide resolved

noooop added 5 commits October 21, 2025 08:54

rename

3a1302b

Signed-off-by: wang.yuqi <noooop@126.com>

StreamingResponse

07b2508

Signed-off-by: wang.yuqi <noooop@126.com>

Merge branch 'main' into binary_response

68276e6

rename

405f11d

Signed-off-by: wang.yuqi <noooop@126.com>

fix

95986f1

Signed-off-by: wang.yuqi <noooop@126.com>

DarkLight1337 reviewed Oct 22, 2025

View reviewed changes

vllm/entrypoints/openai/serving_embedding.py Show resolved Hide resolved

noooop commented Oct 22, 2025

View reviewed changes

vllm/entrypoints/openai/api_server.py Show resolved Hide resolved

DarkLight1337 approved these changes Oct 22, 2025

View reviewed changes

DarkLight1337 merged commit 1f633b8 into vllm-project:main Oct 22, 2025
51 checks passed

noooop deleted the binary_response branch October 22, 2025 10:40

noooop restored the binary_response branch October 22, 2025 11:04

noooop deleted the binary_response branch October 22, 2025 11:08

DarkLight1337 mentioned this pull request Oct 24, 2025

[Benchmark] Enable benchmark to run with encoding_format="bytes" #27467

Merged

5 tasks

uasan mentioned this pull request Oct 25, 2025

Feature: [node] Support Float16Array lancedb/lancedb#2716

Open

This was referenced Oct 28, 2025

[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. #25524

Merged

[Frontend] Speed up online server preprocess by using sync tokenizer. #27407

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066

[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066

Uh oh!

noooop commented Oct 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

noooop commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noooop commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066

[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066

Uh oh!

Conversation

noooop commented Oct 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Improve all pooling task

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

noooop commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noooop commented Oct 22, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

noooop commented Oct 17, 2025 •

edited by github-actions bot

Loading