-
-
Couldn't load subscription status.
- Fork 10.9k
[Core] Hidden State Processors via plugins #21621
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Hidden State Processors via plugins #21621
Conversation
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This PR introduces a plugin system for processing hidden states, which is a great addition for extensibility. The implementation is mostly solid, but I've found a few critical issues related to error handling in the plugin loader and a potential regression in the model runner that could break KV cache transfer. Addressing these will improve the robustness and correctness of the new feature.
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except Exception: pass will silently ignore any errors during plugin loading, which can make debugging very difficult. For instance, if a plugin's entry point has a bug, it will be skipped without any notification. It's better to log the exception to provide visibility into loading failures.
| except Exception: | |
| pass | |
| except Exception: | |
| logger.warning("Failed to load hidden states processor plugin '%s'.", | |
| name, | |
| exc_info=True) |
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
|
QQ: Why can't we use custom poolers to achieve the same thing? |
The REason I am proposing it this way is for the case where a single request is split in multiple child requests, and the hidden states will have to be processed all at once when the requests are completed. Let me give an example with the Prithvi model (I believe applies to other models too): The user submits an image (geotiff, a special type of .tiff file) to vLLM and after the pre-processing a number of child requests are generated for the model depending on the batch size used for pre-processing. All requests linked to that one initial request are processed and only then I can re-build the output frame with the HiddenStatesProcessor. If I were to do it at the pooling stage (and correct me if I am wrong), I would not be able to do it because vLLM might be batching together data from different requests and the output transformations might not necessarily be applied across requests. Also, I would not have visibility on whether all the child requests are completed and therefore not sure if the output can be processed. Doing it in the output processor gives me the impressione that I could more easily aggregate hidden states for all child requests and then process them at once. Is your concern related to having to pass the hidden states between processes like what happens in this proposal? |
|
The pooler in V1 can be applied to multiple requests at once, where the cc @maxdebayser as he should be able to provide more detailed information regarding this |
Apologies, I thought I had replied to this one. The pooler is applied to a list depending on the number of scheduled tokens ( The question I have open are:
|
From my understanding, that's just not how the scheduler works. There is a strict one-to-one relationship between prompt and request.
I think the major roadblock to this is the serialization/deserialization process which currently only supports a limited set of data types. It would be difficult to extend it to "arbitrary" types, not just in terms of code complexity, but also because of security reasons. |
|
Maybe @WoosukKwon @ywang96 can provide more insights re: scheduling |
I see there are cases where a requests fans out a number of child requests right after the input processor is applied. See AsyncLLM.add_request(). The scheduler treats those as separate requests and they are re-aggregated back in the Ouutput processor. That is wha I had in mind to exploit in the future.
This and the above is the reason why I am proposing processing of the hidden states in the OutputProcessor. |
|
Oh, I see what you mean now. You want to aggregate the requests at |
|
In that case I think the naming of hidden state processors is a bit confusing since I usually associate it with model runner (normally, the hidden states are converted into logits inside the model runner). Instead we can call this |
|
If that's the case. I think it would be less intrusive to simply have a wrapper around |
|
There is no performance benefit from implementing this inside |
Exactly, in a soon coming PR I will update the input processor for the prithvi model so that I can provide an image in input and depending on the batching to be used generate a number of child requests in the AsyncLLM engine.
OK, I see your point now. And yes, I want to aggregate the hidden states for 1 request (and child ones if available) and post-process them to whatever format the user wants. This implies that either I apply the custom output processor to the aggregated pooling output or, I extract the hidden states and apply the custom processor to their aggregation. |
| MultiModalConfig.mm_processor_kwargs | ||
| disable_mm_preprocessor_cache: bool = \ | ||
| MultiModalConfig.disable_mm_preprocessor_cache | ||
| process_hidden_states: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this change, the default is only defined in one place
| process_hidden_states: bool = False | |
| process_hidden_states: bool = ModelConfig.process_hidden_states |
|
This pull request has merge conflicts that must be resolved before it can be |
This could be an alternative approach, yes. However what is not fully convincing me yet with this approach is that the input processing is done in a separate processor while there is already a Just to verify my understanding, right now, when a user wants to use a vision model in offline mode they have to load the image with PIL and pass it to the LLMEngine. Right? While in serving mode this is done in the server? |
Yes. But like in the existing example code for your model, you can pass in |
| integer, it is used as the level of compilation optimization. If it | ||
| is a dictionary, it can specify the full compilation configuration. | ||
| process_hidden_states: If True, it loads the hidden states processor | ||
| and to process the hiddne states for each request before returning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: hiddne
|
@DarkLight1337 , I was talking about this the other day with @christian-pinto . The idea here is a little bit different than a custom pooler. In my mind, the pooler running within the GPU model runner should transform the hidden states that come out of the model into tensors or scalars. It shouldn't spend a lot of time running expensive operation in the GPU as that would slow down everything. |
|
Thanks @maxdebayser, that's exactly it. I basically wan to add plugins so that we can generate multimodal-output right away from vLLM. The reason I extract the hidden states in addition to pooling is because I am not sure whether people might still want to run a pooler as well as convert the hidden states into something else. Is this a possibility? If not I can reduce this PR to just applying the custom output plugins to the pooler output in the output processor, without returning the hidden states back to . I find this operation to really belong to the OutputProcessor. Also, in this way there would really be any penalty for those models not using these processors. @DarkLight1337 , would this be more inline with your thinking? |
|
Yeah I think we should avoid returning the full hidden states to the API process unless necessary, since that would result in quite a bit of communication overhead |
|
OK, let me work out a version that works on the pooler output then. |
|
Apologies for the silence on this PR, I had to switch to another task for a few days. I'll resume next week. |
| - "transformers" will use the Transformers model implementation.""" | ||
| override_attention_dtype: Optional[str] = None | ||
| """Override dtype for attention""" | ||
| process_hidden_states: Optional[bool] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this can never be None, we shouldn't hint it as Optional
| process_hidden_states: Optional[bool] = False | |
| process_hidden_states: bool = False |
|
Hey @DarkLight1337 I have created a new PR here: #22820 In the end I ended up liking adding this support at the entrypoint level because it also makes everybody's life easier for enabling the serving mode as well. |
|
Closing as superseded by #22820 |
Purpose
This PR proposes an initial implementation of hidden states processors via plugins. This PR is in the context of #16052 and #12249.
What I propose is that hidden states can be processed before returning the request output to the user by means of loadable plugins. Plugins are defined in the same spirit of platform and generic plugins, here I propose adding a new plugins group
vllm.hidden_states_processor_plugins.In the code I am defining an example default identity plugin, just returning the hidden states as they are and I have an example OOT plugin here.
In the current implementation I only support pooling models, but we can support others once I understand whether it actually makes sense to process hidden states for text generating models. If we go down that way, should we accumulate hidden states for the whole sequence? It seems a lot of data to me and I would still do it for the final ones.
Hidden states processors are a per model instance feature (i.e., all requests will be processed with the same plugin) and
I define an env variable
VLLM_USE_HIDDEN_STATES_PROCESSORthat can be used which plugin to instantiate at model loading time in the case multiple plugins are available. In case the variable is not set, the "first" available plugin is loaded.Also, in this implementation I process the hidden states in the output processor to avoid blocking the model runner. Talking to @maxdebayser one could also think of using a thread pool to decouple the execution of these tasks for other activities.
Comments, suggestions, ideas are all welcome
@DarkLight1337 @maxdebayser @youkaichao @mgazz
Test Plan
I will add proper tests once we are OK with the implementation
Test Result
(Optional) Documentation Update