Skip to content

Conversation

@noooop
Copy link
Collaborator

@noooop noooop commented Oct 18, 2025

Improve all pooling task

These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?

Purpose

Address:

Let's support chunked prefill with ALL pooling

We are moving further towards supporting both generation and Returning Prompt Hidden States simultaneously

Test Plan

tests/models/language/pooling/test_all_pooling_plus_chunked_prefill.py

Test Result

pass


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: wang.yuqi <noooop@126.com>
@mergify mergify bot added the v1 label Oct 18, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables chunked prefill for ALL pooling tasks, which is a great step towards improving performance for long prompts. The changes involve modifying the AllPool layer to cache hidden states during chunked prefill and updating related metadata and data structures. A new test is also added to validate this functionality.

I've found a couple of critical issues that need to be addressed. One is a mutable default argument in PoolingParams which can lead to shared state and incorrect behavior. The other is an incorrect return type in is_partial_prefill which will cause runtime errors. Please see my detailed comments for suggestions on how to fix these.

Signed-off-by: wang.yuqi <noooop@126.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@noooop
Copy link
Collaborator Author

noooop commented Oct 18, 2025

/gemini review

Signed-off-by: wang.yuqi <noooop@126.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for chunked prefill with ALL pooling, which is a valuable enhancement. The core logic in AllPool is updated to cache hidden states across chunks and process them upon completion of prefill. A corresponding test case has been added to validate this new functionality. While the overall approach is sound, I've identified a few critical issues. Notably, a change to is_partial_prefill appears to break other pooling methods. There's also a potential memory leak and correctness bug in AllPool due to the hidden states cache not being cleared. Lastly, a minor typing issue in a method could lead to unexpected behavior. Please see my detailed comments for suggestions on how to address these points.

Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop changed the title [Model][5/N] Improve all pooling task | Support chunked prefill with ALL pooling [Model][3/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 18, 2025
noooop and others added 2 commits October 18, 2025 15:58
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 18, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables chunked prefill for ALL pooling tasks, which was previously unsupported. The changes involve modifying the AllPool layer to cache hidden states across multiple prefill steps and return the complete hidden states only when a request is finished. This is a good step towards more flexible pooling operations. The related metadata and worker logic have been updated accordingly, and a new test case is added to verify the functionality. My review focuses on improving code clarity and maintainability.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 18, 2025

/gemini review

@noooop
Copy link
Collaborator Author

noooop commented Oct 18, 2025

cc @DarkLight1337 @maxdebayser

Ready for review

I came up with a simple method to support chunked prefill with ALL pooling, without modifying too much core code.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for chunked prefill with ALL pooling tasks, which was previously unsupported. The changes are well-structured and involve modifying the AllPool layer to cache hidden states across prefill chunks, updating PoolingParams and PoolingMetadata to support this, and adjusting the GPUModelRunner to handle the new logic. A new test case is also added to validate the implementation.

My main feedback is on a design choice in PoolingParams. I've left a comment with a suggestion for a more robust implementation that better separates data transfer objects from runtime state. Overall, this is a good step towards more flexible and efficient pooling operations.

Comment on lines 29 to 33
# If enable_prefix_caching is enabled,
# the output of all pooling will be less than n_prompt_tokens,
# we need a method to disable prefix_caching at the request level.
enable_prefix_caching=False,
max_num_batched_tokens=chunk_size,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another strange problem was discovered at the same time

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heheda12345

I want all pooling requests to not use prefix_caching, even if enable_prefix_caching=True at the request level. Otherwise, the output hidden_states would be fewer than n_prompt_tokens.

However, I don't want to completely disable prefix_caching when using all pooling, because enabling prefix_caching can accelerate scenarios where generation and returning prompt hidden states are performed simultaneously.

PTAL #24288 (comment)

If you turn on Automatic Prefix Caching, submit LLM.encode(..., pooling_task="token_embed") first, then submit LLM.generate. I think it should be no cost in making two separate calls.

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop
Copy link
Collaborator Author

noooop commented Oct 18, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully enables chunked prefill for ALL pooling tasks, which was previously unsupported. This is a valuable enhancement for models utilizing token_embed or token_classify. The implementation, including the stateful handling of hidden states in AllPool and updates to related data structures, appears correct and well-integrated. The tests have been appropriately updated and new ones added to validate this new capability. I've identified one minor issue in a new test case where an assertion doesn't seem to match its intended purpose. Other than that, this is a solid contribution.

@DarkLight1337
Copy link
Member

@maxdebayser can you help review this? Thanks

Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop changed the title [Model][3/N] Improve all pooling task | Support chunked prefill with ALL pooling [Model][4/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 20, 2025
@maxdebayser
Copy link
Contributor

Thanks for completing this. Back when added support in V1 for pooling and chunked prefill, I left the addition of support for the stateful poolers as a future exercise. My main concern in this PR is that the PoolingParams object is not a good place to store this state.

I have two different suggestions of places where the state could go:

  1. PoolingMetadata

The pooling metadata object could have a state dict that could be used by the poolers. The GPU input batch would have to be modified to handle persistent pooling metadata objects as well as the pooling params object. The advantage here is that in the case of request cancellation we can easily remove the pooling metadata.

  1. The pooler

If we add the request ID in the pooling metadata, then the pooler can have a dict of request ID to state. The only challenge would be to get notified of request cancellation to remove the state.

BTW, the it's easy to extend the chunked prefill support for mean pooling as well. It doesn't even need to store the full hidden states, only the current average and the count of tokens has to be stored.

@noooop
Copy link
Collaborator Author

noooop commented Oct 20, 2025

My main concern in this PR is that the PoolingParams object is not a good place to store this state.

+1

I can’t seem to find a better place to store it either. I tried storing it in the request, but that would require too many code changes.

mainly because it needs to be notified to delete the stored data to prevent memory leaks.

The primary benefit of storing data in PoolingParams is that it can always be garbage-collected along with the request. and without requiring any (or almost any) modifications to the core code.

@maxdebayser
Copy link
Contributor

I think the easiest of the suggestions would be to add the state to the PoolingMetadata. Currently the input batch has a dict of request ID to PoolingParams. And when get_pooling_metadata() is called, we build a PoolingMetadata object and initialize it with a reference to pooling params. But what if the input batch had a dict of request to PoolingMetadata instead, and each PoolingMetadata has a PoolingParams object?

@noooop
Copy link
Collaborator Author

noooop commented Oct 21, 2025

I think the easiest of the suggestions would be to add the state to the PoolingMetadata.

def get_pooling_metadata(self) -> PoolingMetadata:
pooling_params = self.get_pooling_params()
return PoolingMetadata(
prompt_lens=torch.from_numpy(self.num_prompt_tokens[: self.num_reqs]),
prompt_token_ids=self.sampling_metadata.prompt_token_ids,
pooling_params=pooling_params,
)

Each executor step creates a new PoolingMetadata. If hidden_states_cache is in PoolingMetadata, the position needs to be moved every step . More importantly, during two executions, some requests might not be scheduled, and these unscheduled requests' hidden_states_cache don't know where to place them.


Mean pooling + chunked prefill can be considered later; in fact, we don't have a model that uses mean pooling + chunked prefill yet.

Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop changed the title [Model][4/N] Improve all pooling task | Support chunked prefill with ALL pooling [Model][5/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 22, 2025
@mergify
Copy link

mergify bot commented Oct 27, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 27, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
@mergify mergify bot removed the needs-rebase label Oct 28, 2025
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
@noooop noooop changed the title [Model][5/N] Improve all pooling task | Support chunked prefill with ALL pooling [Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants