Skip to content

Conversation

@noooop
Copy link
Collaborator

@noooop noooop commented Oct 17, 2025

TL;DR

  • support endianness: ["native", "big", "little"], native by default
  • support bytes encoding_format, a very simple (but highly efficient) binary embedding response method
    • First, write metadata into the headers, then write all the binary data in order into Response body.
    • When reading, first read the headers, then read the body according to the offset into the corresponding tenser
  • This API provides three benefits @uasan
  1. Significant reduction in response size
  2. No need for JSON.parse and Base64.decode
  3. Possibility of stream processing of the response
  4. Endianness customizations

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

Response compression if the client sends the 'accept-encoding' header: 'zstd, gzip'
Use Response compression to transmit binary files that don't need base64, cool. base64 is very inefficient
I wouldn't think of this hacky method before seeing this issue.

Thanks @uasan for this cool idea

Fix #27063

cc @christian-pinto @maxdebayser @DarkLight1337

Test Plan

tests/utils_/test_serial_utils.py
tests/entrypoints/pooling/openai/test_embedding.py
tests/entrypoints/pooling/openai/test_pooling.py

Test Result

pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@noooop noooop changed the title [Frontend][1/N] Improve all pooling task | Support binary Embedding response by response compression [Frontend][3/N] Improve all pooling task | Support binary Embedding response by response compression Oct 17, 2025
@mergify mergify bot added the frontend label Oct 17, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop changed the title [Frontend][3/N] Improve all pooling task | Support binary Embedding response by response compression [Frontend][3/N] Improve all pooling task | Support binary embedding response Oct 18, 2025
@noooop noooop changed the title [Frontend][3/N] Improve all pooling task | Support binary embedding response [Frontend][4/N] Improve all pooling task | Support binary embedding response Oct 18, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify
Copy link

mergify bot commented Oct 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 20, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify
Copy link

mergify bot commented Oct 20, 2025

Documentation preview: https://vllm--27066.org.readthedocs.build/en/27066/

@mergify mergify bot added the documentation Improvements or additions to documentation label Oct 20, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify mergify bot removed the needs-rebase label Oct 20, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 20, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an excellent feature for improving performance by supporting binary embedding responses, including a new bytes encoding format and endianness customization. The refactoring of serialization logic into vllm/utils/serial_utils.py is a positive structural change.

However, my review has identified a critical data corruption bug in the new serialization logic specifically for bfloat16 tensors. The current implementation incorrectly reinterprets bfloat16 bit patterns as float16, which leads to corrupted data. This issue is not caught by the existing tests because they only verify round-trip consistency, where the symmetric corruption cancels itself out. I've provided a detailed comment with a code suggestion to fix this critical bug.

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 22, 2025

@DarkLight1337

Are there any more modifications needed for this PR?

@DarkLight1337 DarkLight1337 merged commit 1f633b8 into vllm-project:main Oct 22, 2025
51 checks passed
@noooop noooop deleted the binary_response branch October 22, 2025 10:40
@noooop noooop restored the binary_response branch October 22, 2025 11:04
@noooop noooop deleted the binary_response branch October 22, 2025 11:08
usberkeley pushed a commit to usberkeley/vllm that referenced this pull request Oct 23, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
albertoperdomo2 pushed a commit to albertoperdomo2/vllm that referenced this pull request Oct 23, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: Alberto Perdomo <aperdomo@redhat.com>
kingsmad pushed a commit to kingsmad/vllm that referenced this pull request Oct 25, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
0xrushi pushed a commit to 0xrushi/vllm that referenced this pull request Oct 26, 2025
…esponse (vllm-project#27066)

Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Signed-off-by: 0xrushi <6279035+0xrushi@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Improvements to front-end embedding response

4 participants