-
-
Couldn't load subscription status.
- Fork 10.8k
[Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Model][-/N] Improve all pooling task | Support chunked prefill with ALL pooling #27145
Conversation
Signed-off-by: wang.yuqi <noooop@126.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables chunked prefill for ALL pooling tasks, which is a great step towards improving performance for long prompts. The changes involve modifying the AllPool layer to cache hidden states during chunked prefill and updating related metadata and data structures. A new test is also added to validate this functionality.
I've found a couple of critical issues that need to be addressed. One is a mutable default argument in PoolingParams which can lead to shared state and incorrect behavior. The other is an incorrect return type in is_partial_prefill which will cause runtime errors. Please see my detailed comments for suggestions on how to fix these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for chunked prefill with ALL pooling, which is a valuable enhancement. The core logic in AllPool is updated to cache hidden states across chunks and process them upon completion of prefill. A corresponding test case has been added to validate this new functionality. While the overall approach is sound, I've identified a few critical issues. Notably, a change to is_partial_prefill appears to break other pooling methods. There's also a potential memory leak and correctness bug in AllPool due to the hidden states cache not being cleared. Lastly, a minor typing issue in a method could lead to unexpected behavior. Please see my detailed comments for suggestions on how to address these points.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables chunked prefill for ALL pooling tasks, which was previously unsupported. The changes involve modifying the AllPool layer to cache hidden states across multiple prefill steps and return the complete hidden states only when a request is finished. This is a good step towards more flexible pooling operations. The related metadata and worker logic have been updated accordingly, and a new test case is added to verify the functionality. My review focuses on improving code clarity and maintainability.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: wang.yuqi <noooop@126.com>
|
/gemini review |
|
cc @DarkLight1337 @maxdebayser Ready for review I came up with a simple method to support chunked prefill with ALL pooling, without modifying too much core code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces support for chunked prefill with ALL pooling tasks, which was previously unsupported. The changes are well-structured and involve modifying the AllPool layer to cache hidden states across prefill chunks, updating PoolingParams and PoolingMetadata to support this, and adjusting the GPUModelRunner to handle the new logic. A new test case is also added to validate the implementation.
My main feedback is on a design choice in PoolingParams. I've left a comment with a suggestion for a more robust implementation that better separates data transfer objects from runtime state. Overall, this is a good step towards more flexible and efficient pooling operations.
| # If enable_prefix_caching is enabled, | ||
| # the output of all pooling will be less than n_prompt_tokens, | ||
| # we need a method to disable prefix_caching at the request level. | ||
| enable_prefix_caching=False, | ||
| max_num_batched_tokens=chunk_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another strange problem was discovered at the same time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want all pooling requests to not use prefix_caching, even if enable_prefix_caching=True at the request level. Otherwise, the output hidden_states would be fewer than n_prompt_tokens.
However, I don't want to completely disable prefix_caching when using all pooling, because enabling prefix_caching can accelerate scenarios where generation and returning prompt hidden states are performed simultaneously.
PTAL #24288 (comment)
If you turn on Automatic Prefix Caching, submit LLM.encode(..., pooling_task="token_embed") first, then submit LLM.generate. I think it should be no cost in making two separate calls.
Signed-off-by: wang.yuqi <noooop@126.com>
|
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully enables chunked prefill for ALL pooling tasks, which was previously unsupported. This is a valuable enhancement for models utilizing token_embed or token_classify. The implementation, including the stateful handling of hidden states in AllPool and updates to related data structures, appears correct and well-integrated. The tests have been appropriately updated and new ones added to validate this new capability. I've identified one minor issue in a new test case where an assertion doesn't seem to match its intended purpose. Other than that, this is a solid contribution.
|
@maxdebayser can you help review this? Thanks |
Signed-off-by: wang.yuqi <noooop@126.com>
Signed-off-by: wang.yuqi <noooop@126.com>
|
Thanks for completing this. Back when added support in V1 for pooling and chunked prefill, I left the addition of support for the stateful poolers as a future exercise. My main concern in this PR is that the PoolingParams object is not a good place to store this state. I have two different suggestions of places where the state could go:
The pooling metadata object could have a state dict that could be used by the poolers. The GPU input batch would have to be modified to handle persistent pooling metadata objects as well as the pooling params object. The advantage here is that in the case of request cancellation we can easily remove the pooling metadata.
If we add the request ID in the pooling metadata, then the pooler can have a dict of request ID to state. The only challenge would be to get notified of request cancellation to remove the state. BTW, the it's easy to extend the chunked prefill support for mean pooling as well. It doesn't even need to store the full hidden states, only the current average and the count of tokens has to be stored. |
+1 I can’t seem to find a better place to store it either. I tried storing it in the request, but that would require too many code changes. mainly because it needs to be notified to delete the stored data to prevent memory leaks. The primary benefit of storing data in PoolingParams is that it can always be garbage-collected along with the request. and without requiring any (or almost any) modifications to the core code. |
|
I think the easiest of the suggestions would be to add the state to the PoolingMetadata. Currently the input batch has a dict of request ID to PoolingParams. And when |
vllm/vllm/v1/worker/gpu_input_batch.py Lines 834 to 841 in bfe0b4b
Each executor step creates a new PoolingMetadata. If hidden_states_cache is in PoolingMetadata, the position needs to be moved every step . More importantly, during two executions, some requests might not be scheduled, and these unscheduled requests' hidden_states_cache don't know where to place them. Mean pooling + chunked prefill can be considered later; in fact, we don't have a model that uses mean pooling + chunked prefill yet. |
Signed-off-by: wang.yuqi <noooop@126.com>
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: wang.yuqi <noooop@126.com>
Improve all pooling task
These PRs are mostly conflicting with each other, so combining them into a series would better inform reviewers about what happened. And what else needs to be done after that?
Purpose
Address:
Let's support chunked prefill with ALL pooling
We are moving further towards supporting both generation and Returning Prompt Hidden States simultaneously
Test Plan
tests/models/language/pooling/test_all_pooling_plus_chunked_prefill.py
Test Result
pass
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.