[Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145

noooop · 2025-10-18T07:47:44Z

Improve all pooling task

[Model][0/N] Improve all pooling task | clean up #25817
[Frontend][1/N] Improve all pooling task | Support FP16 Embedding Base64 (Still uses fp32 by default). #26414
[Model][2/N] Improve all pooling task | Support multi-vector retrieval #25370
[Frontend][3/N] Improve all pooling task | Support binary embedding response #27066
[Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145
[Frontend][4/N] Improve all pooling task | Add plugin pooling task #26973
Support bge-m3 sparse embeddings (lexical weights) #14526 <- It is best to be after [Frontend][4/N] Improve all pooling task | Add plugin pooling task #26973, before [Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. #25524
[Model][6/N] Improve all pooling task | Improve enable chunked_prefill & prefix_caching logic. #26623
[Model][7/N] Improve all pooling task | Deprecation as_reward_model. Extract hidden states prefer using new multi-vector retrieval API #26686
[Frontend][Doc][5/N] Improve all pooling task | Polish encode (pooling) api & Document. #25524

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

Address:

Let's support chunked prefill with ALL pooling

We are moving further towards supporting both generation and Returning Prompt Hidden States simultaneously

Test Plan

tests/models/language/pooling/test_all_pooling_plus_chunked_prefill.py

Test Result

pass

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <noooop@126.com>

gemini-code-assist

Code Review

This pull request enables chunked prefill for ALL pooling tasks, which is a great step towards improving performance for long prompts. The changes involve modifying the AllPool layer to cache hidden states during chunked prefill and updating related metadata and data structures. A new test is also added to validate this functionality.

I've found a couple of critical issues that need to be addressed. One is a mutable default argument in PoolingParams which can lead to shared state and incorrect behavior. The other is an incorrect return type in is_partial_prefill which will cause runtime errors. Please see my detailed comments for suggestions on how to fix these.

vllm/pooling_params.py

vllm/v1/pool/metadata.py

Signed-off-by: wang.yuqi <noooop@126.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/pool/metadata.py

vllm/pooling_params.py

noooop · 2025-10-18T07:51:23Z

/gemini review

Signed-off-by: wang.yuqi <noooop@126.com>

gemini-code-assist

Code Review

This pull request introduces support for chunked prefill with ALL pooling, which is a valuable enhancement. The core logic in AllPool is updated to cache hidden states across chunks and process them upon completion of prefill. A corresponding test case has been added to validate this new functionality. While the overall approach is sound, I've identified a few critical issues. Notably, a change to is_partial_prefill appears to break other pooling methods. There's also a potential memory leak and correctness bug in AllPool due to the hidden states cache not being cleared. Lastly, a minor typing issue in a method could lead to unexpected behavior. Please see my detailed comments for suggestions on how to address these points.

vllm/model_executor/layers/pooler.py

vllm/v1/pool/metadata.py

Signed-off-by: wang.yuqi <noooop@126.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-18T07:59:03Z

/gemini review

gemini-code-assist

Code Review

This pull request enables chunked prefill for ALL pooling tasks, which was previously unsupported. The changes involve modifying the AllPool layer to cache hidden states across multiple prefill steps and return the complete hidden states only when a request is finished. This is a good step towards more flexible pooling operations. The related metadata and worker logic have been updated accordingly, and a new test case is added to verify the functionality. My review focuses on improving code clarity and maintainability.

vllm/model_executor/layers/pooler.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-18T08:02:16Z

/gemini review

noooop · 2025-10-18T08:04:09Z

cc @DarkLight1337 @maxdebayser

Ready for review

I came up with a simple method to support chunked prefill with ALL pooling, without modifying too much core code.

gemini-code-assist

Code Review

This pull request introduces support for chunked prefill with ALL pooling tasks, which was previously unsupported. The changes are well-structured and involve modifying the AllPool layer to cache hidden states across prefill chunks, updating PoolingParams and PoolingMetadata to support this, and adjusting the GPUModelRunner to handle the new logic. A new test case is also added to validate the implementation.

My main feedback is on a design choice in PoolingParams. I've left a comment with a suggestion for a more robust implementation that better separates data transfer objects from runtime state. Overall, this is a good step towards more flexible and efficient pooling operations.

vllm/pooling_params.py

noooop · 2025-10-18T08:05:59Z

tests/models/language/pooling/test_all_pooling_plus_chunked_prefill.py

+        # If enable_prefix_caching is enabled,
+        # the output of all pooling will be less than n_prompt_tokens,
+        # we need a method to disable prefix_caching at the request level.
+        enable_prefix_caching=False,
+        max_num_batched_tokens=chunk_size,


Another strange problem was discovered at the same time

@heheda12345

I want all pooling requests to not use prefix_caching, even if enable_prefix_caching=True at the request level. Otherwise, the output hidden_states would be fewer than n_prompt_tokens.

However, I don't want to completely disable prefix_caching when using all pooling, because enabling prefix_caching can accelerate scenarios where generation and returning prompt hidden states are performed simultaneously.

PTAL #24288 (comment)

If you turn on Automatic Prefix Caching, submit LLM.encode(..., pooling_task="token_embed") first, then submit LLM.generate. I think it should be no cost in making two separate calls.

Signed-off-by: wang.yuqi <noooop@126.com>

noooop · 2025-10-18T11:46:37Z

/gemini review

gemini-code-assist

Code Review

This pull request successfully enables chunked prefill for ALL pooling tasks, which was previously unsupported. This is a valuable enhancement for models utilizing token_embed or token_classify. The implementation, including the stateful handling of hidden states in AllPool and updates to related data structures, appears correct and well-integrated. The tests have been appropriately updated and new ones added to validate this new capability. I've identified one minor issue in a new test case where an assertion doesn't seem to match its intended purpose. Other than that, this is a solid contribution.

tests/models/language/pooling/test_extract_hidden_states.py

DarkLight1337 · 2025-10-18T13:19:43Z

@maxdebayser can you help review this? Thanks

Signed-off-by: wang.yuqi <noooop@126.com>

maxdebayser · 2025-10-20T14:56:27Z

Thanks for completing this. Back when added support in V1 for pooling and chunked prefill, I left the addition of support for the stateful poolers as a future exercise. My main concern in this PR is that the PoolingParams object is not a good place to store this state.

I have two different suggestions of places where the state could go:

PoolingMetadata

The pooling metadata object could have a state dict that could be used by the poolers. The GPU input batch would have to be modified to handle persistent pooling metadata objects as well as the pooling params object. The advantage here is that in the case of request cancellation we can easily remove the pooling metadata.

The pooler

If we add the request ID in the pooling metadata, then the pooler can have a dict of request ID to state. The only challenge would be to get notified of request cancellation to remove the state.

BTW, the it's easy to extend the chunked prefill support for mean pooling as well. It doesn't even need to store the full hidden states, only the current average and the count of tokens has to be stored.

noooop · 2025-10-20T15:46:41Z

My main concern in this PR is that the PoolingParams object is not a good place to store this state.

+1

I can’t seem to find a better place to store it either. I tried storing it in the request, but that would require too many code changes.

mainly because it needs to be notified to delete the stored data to prevent memory leaks.

The primary benefit of storing data in PoolingParams is that it can always be garbage-collected along with the request. and without requiring any (or almost any) modifications to the core code.

maxdebayser · 2025-10-20T17:14:46Z

I think the easiest of the suggestions would be to add the state to the PoolingMetadata. Currently the input batch has a dict of request ID to PoolingParams. And when get_pooling_metadata() is called, we build a PoolingMetadata object and initialize it with a reference to pooling params. But what if the input batch had a dict of request to PoolingMetadata instead, and each PoolingMetadata has a PoolingParams object?

noooop · 2025-10-21T01:02:58Z

I think the easiest of the suggestions would be to add the state to the PoolingMetadata.

vllm/vllm/v1/worker/gpu_input_batch.py

Lines 834 to 841 in bfe0b4b

    
           def get_pooling_metadata(self) -> PoolingMetadata: 
        
               pooling_params = self.get_pooling_params() 
        
               return PoolingMetadata( 
        
                   prompt_lens=torch.from_numpy(self.num_prompt_tokens[: self.num_reqs]), 
        
                   prompt_token_ids=self.sampling_metadata.prompt_token_ids, 
        
                   pooling_params=pooling_params, 
        
               )

Each executor step creates a new PoolingMetadata. If hidden_states_cache is in PoolingMetadata, the position needs to be moved every step . More importantly, during two executions, some requests might not be scheduled, and these unscheduled requests' hidden_states_cache don't know where to place them.

Mean pooling + chunked prefill can be considered later; in fact, we don't have a model that uses mean pooling + chunked prefill yet.

Signed-off-by: wang.yuqi <noooop@126.com>

mergify · 2025-10-27T14:38:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: wang.yuqi <noooop@126.com>

Support chunked prefill with ALL pooling

86c0f38

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot added the v1 label Oct 18, 2025

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

vllm/pooling_params.py Outdated Show resolved Hide resolved

vllm/v1/pool/metadata.py Outdated Show resolved Hide resolved

fix

6bd49f2

Signed-off-by: wang.yuqi <noooop@126.com>

chatgpt-codex-connector bot reviewed Oct 18, 2025

View reviewed changes

vllm/v1/pool/metadata.py Show resolved Hide resolved

vllm/pooling_params.py Show resolved Hide resolved

fix

44c6ee1

Signed-off-by: wang.yuqi <noooop@126.com>

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

vllm/model_executor/layers/pooler.py Outdated Show resolved Hide resolved

vllm/model_executor/layers/pooler.py Outdated Show resolved Hide resolved

vllm/v1/pool/metadata.py Outdated Show resolved Hide resolved

fix

86f0868

Signed-off-by: wang.yuqi <noooop@126.com>

noooop changed the title ~~[Model][5/N] Improve all pooling task | Support chunked prefill with ALL pooling~~ [Model][3/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 18, 2025

noooop and others added 2 commits October 18, 2025 15:58

Update vllm/model_executor/layers/pooler.py

7c1d68d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

Update vllm/model_executor/layers/pooler.py

f903415

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

vllm/model_executor/layers/pooler.py Outdated Show resolved Hide resolved

Update vllm/model_executor/layers/pooler.py

72df85d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

vllm/pooling_params.py Outdated Show resolved Hide resolved

noooop commented Oct 18, 2025

View reviewed changes

fix deep copy

6b6e7a8

Signed-off-by: wang.yuqi <noooop@126.com>

noooop added 2 commits October 18, 2025 16:35

fix

9aef354

Signed-off-by: wang.yuqi <noooop@126.com>

+ tests

d574b6c

Signed-off-by: wang.yuqi <noooop@126.com>

Merge branch 'main' into all_pooling_plus_chunked_prefill2

26351d7

gemini-code-assist bot reviewed Oct 18, 2025

View reviewed changes

tests/models/language/pooling/test_extract_hidden_states.py Show resolved Hide resolved

fix

5c4b13c

Signed-off-by: wang.yuqi <noooop@126.com>

noooop requested review from aarnphm and chaunceyjiang as code owners October 18, 2025 13:35

mergify bot added the frontend label Oct 18, 2025

noooop added 2 commits October 18, 2025 21:58

fix StepPooler

178ccd2

Signed-off-by: wang.yuqi <noooop@126.com>

fix StepPooler

43291db

Signed-off-by: wang.yuqi <noooop@126.com>

noooop changed the title ~~[Model][3/N] Improve all pooling task | Support chunked prefill with ALL pooling~~ [Model][4/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 20, 2025

+ preempted_req

bb9a4ad

Signed-off-by: wang.yuqi <noooop@126.com>

noooop requested review from ApostaC, WoosukKwon, alexm-redhat, comaniac, heheda12345, njhill, robertgshaw2-redhat and ywang96 as code owners October 21, 2025 01:36

Merge branch 'main' into all_pooling_plus_chunked_prefill2

08a0739

noooop changed the title ~~[Model][4/N] Improve all pooling task | Support chunked prefill with ALL pooling~~ [Model][5/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 22, 2025

mergify bot added the needs-rebase label Oct 27, 2025

Merge branch 'main' into all_pooling_plus_chunked_prefill2

29b3d1d

Signed-off-by: wang.yuqi <noooop@126.com>

mergify bot removed the needs-rebase label Oct 28, 2025

noooop added 2 commits October 28, 2025 13:49

update

eea5f6c

Signed-off-by: wang.yuqi <noooop@126.com>

update

41ff486

Signed-off-by: wang.yuqi <noooop@126.com>

noooop changed the title ~~[Model][5/N] Improve all pooling task | Support chunked prefill with ALL pooling~~ [Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 28, 2025

Uh oh!

Uh oh!

[Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145

Are you sure you want to change the base?

[Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145

Conversation

noooop commented Oct 18, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Improve all pooling task

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

noooop commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noooop commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

noooop commented Oct 18, 2025

Uh oh!

noooop commented Oct 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

noooop Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

noooop Oct 18, 2025

Choose a reason for hiding this comment

Uh oh!

noooop commented Oct 18, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

DarkLight1337 commented Oct 18, 2025

Uh oh!

maxdebayser commented Oct 20, 2025

Uh oh!

noooop commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented Oct 20, 2025

Uh oh!

noooop commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Oct 27, 2025

Uh oh!

Reviewers

Assignees

noooop commented Oct 18, 2025 •

edited by github-actions bot

Loading

noooop commented Oct 18, 2025 •

edited

Loading

noooop commented Oct 20, 2025 •

edited

Loading

noooop commented Oct 21, 2025 •

edited

Loading