[Frontend] Add chunked processing to handle long inputs in embedding models #22280

x22x22 · 2025-08-05T22:01:28Z

Original pr
#20837

The previous submission was not merged for too long, resulting in too many conflicts, so I'm resubmitting it.

…scripts, incorporating chunk processing capabilities to handle exceptionally long inputs. The README documentation has been revised to provide comprehensive instructions on usage methods and configuration options. Signed-off-by: x22x22 <wadeking@qq.com>

github-actions · 2025-08-05T22:01:37Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request introduces chunked processing for long text embeddings, allowing vLLM to handle inputs that exceed the model's maximum context length. The changes include new configuration options in PoolerConfig, core logic for chunking and aggregation in serving_embedding.py, and several robustness improvements in serving_engine.py. Additionally, new examples are provided to demonstrate and test the new feature.

My review found one potential high-severity issue in vllm/config.py where logic for selecting the correct transformers backend for pooling models seems to have been accidentally removed. This could lead to errors when using pooling models with the transformers backend. The rest of the implementation for chunked processing appears solid and well-designed.

vllm/config.py

Signed-off-by: x22x22 <wadeking@qq.com>

…iguration options to facilitate long-text input support. Signed-off-by: x22x22 <wadeking@qq.com>

Signed-off-by: x22x22 <wadeking@qq.com>

… Extensive Texts Signed-off-by: x22x22 <wadeking@qq.com>

vllm/inputs/registry.py

vllm/config.py

vllm/entrypoints/openai/serving_embedding.py

- Restore vllm/transformers_utils/processor.py to main branch - Restore vllm/inputs/registry.py to main branch - Ensure all file metadata matches main branch exactly Signed-off-by: x22x22 <wadeking@qq.com>

…edding generation. Signed-off-by: x22x22 <wadeking@qq.com>

vllm/entrypoints/openai/serving_embedding.py

Signed-off-by: x22x22 <wadeking@qq.com>

x22x22 · 2025-08-13T07:53:22Z

@maxdebayser @DarkLight1337 @hmellor
All done, thanks!

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 · 2025-08-13T08:16:08Z

Let's merge this

x22x22 · 2025-08-13T08:54:13Z

python -m mypy vllm/entrypoints/openai/serving_embedding.py --python-version 3.11
vllm/entrypoints/openai/serving_embedding.py:160: error: Incompatible return value type (got "PoolerConfig | bool | None", expected "bool")  [return-value]
Found 1 error in 1 file (checked 1 source file)

After merging the main, new errors occurred. Let me handle it.

Signed-off-by: x22x22 <wadeking@qq.com>

…models (vllm-project#22280) Signed-off-by: x22x22 <wadeking@qq.com> Signed-off-by: Kdump <rootshellexp@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Diego-Castan <diego.castan@ibm.com>

…models (vllm-project#22280) Signed-off-by: x22x22 <wadeking@qq.com> Signed-off-by: Kdump <rootshellexp@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

…models (vllm-project#22280) Signed-off-by: x22x22 <wadeking@qq.com> Signed-off-by: Kdump <rootshellexp@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Xiao Yu <xiao.yu@amd.com>

…models (vllm-project#22280) Signed-off-by: x22x22 <wadeking@qq.com> Signed-off-by: Kdump <rootshellexp@gmail.com> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Maximilien de Bayser <maxdebayser@gmail.com> Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

x22x22 requested review from WoosukKwon, aarnphm, hmellor, houseroad, mgoin, robertgshaw2-redhat, simon-mo, tlrmchlsmth and youkaichao as code owners August 5, 2025 22:01

mergify bot added documentation Improvements or additions to documentation frontend labels Aug 5, 2025

gemini-code-assist bot reviewed Aug 5, 2025

View reviewed changes

vllm/config.py Outdated Show resolved Hide resolved

x22x22 added 6 commits August 6, 2025 06:08

修复合并多模态处理器参数的逻辑，确保正确合并传入的参数。更新了相关文件以使用新的合并方式。

cab8200

Signed-off-by: x22x22 <wadeking@qq.com>

restore

57987aa

Signed-off-by: x22x22 <wadeking@qq.com>

Feature: Implement chunk processing and maximum embedding length conf…

8e3ba72

…iguration options to facilitate long-text input support. Signed-off-by: x22x22 <wadeking@qq.com>

restore

f24b546

Signed-off-by: x22x22 <wadeking@qq.com>

restore

b46791b

Signed-off-by: x22x22 <wadeking@qq.com>

Feature: Implementation of Chunk Processing for Embedding Requests of…

54c7930

… Extensive Texts Signed-off-by: x22x22 <wadeking@qq.com>