Skip to content

Conversation

@christian-pinto
Copy link
Contributor

@christian-pinto christian-pinto commented Jul 25, 2025

Purpose

This PR proposes an initial implementation of hidden states processors via plugins. This PR is in the context of #16052 and #12249.

What I propose is that hidden states can be processed before returning the request output to the user by means of loadable plugins. Plugins are defined in the same spirit of platform and generic plugins, here I propose adding a new plugins group vllm.hidden_states_processor_plugins.

In the code I am defining an example default identity plugin, just returning the hidden states as they are and I have an example OOT plugin here.

In the current implementation I only support pooling models, but we can support others once I understand whether it actually makes sense to process hidden states for text generating models. If we go down that way, should we accumulate hidden states for the whole sequence? It seems a lot of data to me and I would still do it for the final ones.

Hidden states processors are a per model instance feature (i.e., all requests will be processed with the same plugin) and
I define an env variable VLLM_USE_HIDDEN_STATES_PROCESSOR that can be used which plugin to instantiate at model loading time in the case multiple plugins are available. In case the variable is not set, the "first" available plugin is loaded.

Also, in this implementation I process the hidden states in the output processor to avoid blocking the model runner. Talking to @maxdebayser one could also think of using a thread pool to decouple the execution of these tasks for other activities.

Comments, suggestions, ideas are all welcome

@DarkLight1337 @maxdebayser @youkaichao @mgazz

Test Plan

I will add proper tests once we are OK with the implementation

Test Result

(Optional) Documentation Update

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces a plugin system for processing hidden states, which is a great addition for extensibility. The implementation is mostly solid, but I've found a few critical issues related to error handling in the plugin loader and a potential regression in the model runner that could break KV cache transfer. Addressing these will improve the robustness and correctness of the new feature.

Comment on lines +50 to +51
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The broad except Exception: pass will silently ignore any errors during plugin loading, which can make debugging very difficult. For instance, if a plugin's entry point has a bug, it will be skipped without any notification. It's better to log the exception to provide visibility into loading failures.

Suggested change
except Exception:
pass
except Exception:
logger.warning("Failed to load hidden states processor plugin '%s'.",
name,
exc_info=True)

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
@DarkLight1337
Copy link
Member

QQ: Why can't we use custom poolers to achieve the same thing?

@DarkLight1337 DarkLight1337 self-assigned this Jul 28, 2025
@christian-pinto
Copy link
Contributor Author

QQ: Why can't we use custom poolers to achieve the same thing?

The REason I am proposing it this way is for the case where a single request is split in multiple child requests, and the hidden states will have to be processed all at once when the requests are completed.

Let me give an example with the Prithvi model (I believe applies to other models too): The user submits an image (geotiff, a special type of .tiff file) to vLLM and after the pre-processing a number of child requests are generated for the model depending on the batch size used for pre-processing. All requests linked to that one initial request are processed and only then I can re-build the output frame with the HiddenStatesProcessor. If I were to do it at the pooling stage (and correct me if I am wrong), I would not be able to do it because vLLM might be batching together data from different requests and the output transformations might not necessarily be applied across requests. Also, I would not have visibility on whether all the child requests are completed and therefore not sure if the output can be processed.

Doing it in the output processor gives me the impressione that I could more easily aggregate hidden states for all child requests and then process them at once.

Is your concern related to having to pass the hidden states between processes like what happens in this proposal?

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 28, 2025

The pooler in V1 can be applied to multiple requests at once, where the hidden_states comes in a list where each element corresponds to one request. It is only applied on the final hidden states (which in a generative model, would otherwise be converted into logits)

cc @maxdebayser as he should be able to provide more detailed information regarding this

@christian-pinto
Copy link
Contributor Author

The pooler in V1 can be applied to multiple requests at once, where the hidden_states comes in a list where each element corresponds to one request. It is only applied on the final hidden states (which in a generative model, would otherwise be converted into logits)

cc @maxdebayser as he should be able to provide more detailed information regarding this

Apologies, I thought I had replied to this one.

The pooler is applied to a list depending on the number of scheduled tokens (gpu_model_runner.py:1371). The match with requests is instead done later in the scheduler where in update_from_output(). In the case of the Prithvi model I see a 1-to-1 match between request and tokens scheduled since I schedule one fake token with every request. So in general yes, the hidden states processing could be applied in the pooler.

The question I have open are:

  • What if one request ends up spawning a number of child requests? I am planning on doing this after pre-processing the image with one child handling a batch of input patches. Having the hidden states in the pooler would not work in this case because the child requests are re-conciled only in the output processor and. I need the processor to be applies to the output of all patches altogether. This could be the case for other models as well going forward if we attract more multi-model output ones.
  • In your initial proposal of hidden states procesor I see you had the processor generate a tensor in Output. In my view the processor should be allowed to return any object. I could see a processor generating an image objec, or even writing it to file (offline inference). Any takes on this?

@DarkLight1337
Copy link
Member

What if one request ends up spawning a number of child requests?

From my understanding, that's just not how the scheduler works. There is a strict one-to-one relationship between prompt and request.

The processor should be allowed to return any object

I think the major roadblock to this is the serialization/deserialization process which currently only supports a limited set of data types. It would be difficult to extend it to "arbitrary" types, not just in terms of code complexity, but also because of security reasons.

@DarkLight1337
Copy link
Member

Maybe @WoosukKwon @ywang96 can provide more insights re: scheduling

@christian-pinto
Copy link
Contributor Author

What if one request ends up spawning a number of child requests?

From my understanding, that's just not how the scheduler works. There is a strict one-to-one relationship between prompt and request.

I see there are cases where a requests fans out a number of child requests right after the input processor is applied. See AsyncLLM.add_request(). The scheduler treats those as separate requests and they are re-aggregated back in the Ouutput processor. That is wha I had in mind to exploit in the future.

The processor should be allowed to return any object

I think the major roadblock to this is the serialization/deserialization process which currently only supports a limited set of data types. It would be difficult to extend it to "arbitrary" types, not just in terms of code complexity, but also because of security reasons.

This and the above is the reason why I am proposing processing of the hidden states in the OutputProcessor.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 29, 2025

Oh, I see what you mean now. You want to aggregate the requests at AsyncLLM level instead of inside the model runner? That's outside the scope of the scheduler which is inside EngineCore.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 29, 2025

In that case I think the naming of hidden state processors is a bit confusing since I usually associate it with model runner (normally, the hidden states are converted into logits inside the model runner).

Instead we can call this HiddenStatesAggregator or OutputAggregator

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 29, 2025

If that's the case. I think it would be less intrusive to simply have a wrapper around LLM class:

class LLMForImagePatches:
    llm: LLM

    def _split_image(self, image):
        ...

    def _get_inputs(self, image_patch):
        return {"prompt_token_ids": [1], "multi_modal_data": {"pixel_values": ..., "location_coords": ...}}

    def _aggregate_outputs(self, outputs):
        ...

    def predict(self, image):
        patches = self._split_image(image)
        outputs = self.llm.encode([self._get_inputs(patch) for patch in patches])
        return self._aggregate_outputs(outputs)

@DarkLight1337
Copy link
Member

DarkLight1337 commented Jul 29, 2025

There is no performance benefit from implementing this inside LLM/AsyncLLM because it's in the same process as the user code / API server.

@christian-pinto
Copy link
Contributor Author

Oh, I see what you mean now. You want to aggregate the requests at AsyncLLM level instead of inside the model runner? That's outside the scope of the scheduler which is inside EngineCore.

Exactly, in a soon coming PR I will update the input processor for the prithvi model so that I can provide an image in input and depending on the batching to be used generate a number of child requests in the AsyncLLM engine.

In that case I think the naming of hidden state processors is a bit confusing since I usually associate it with model runner (normally, the hidden states are converted into logits inside the model runner).
Instead we can call this OutputAggregator

OK, I see your point now. And yes, I want to aggregate the hidden states for 1 request (and child ones if available) and post-process them to whatever format the user wants. This implies that either I apply the custom output processor to the aggregated pooling output or, I extract the hidden states and apply the custom processor to their aggregation.
I can go ahead with the processing output of the pooler unless people instead of extracting hidden states, unless people find it useful to both apply a pooler and pos-process the hiddens states into something else.

MultiModalConfig.mm_processor_kwargs
disable_mm_preprocessor_cache: bool = \
MultiModalConfig.disable_mm_preprocessor_cache
process_hidden_states: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, the default is only defined in one place

Suggested change
process_hidden_states: bool = False
process_hidden_states: bool = ModelConfig.process_hidden_states

@mergify
Copy link

mergify bot commented Jul 29, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @christian-pinto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jul 29, 2025
@christian-pinto
Copy link
Contributor Author

If that's the case. I think it would be less intrusive to simply have a wrapper around LLM class:

class LLMForImagePatches:
    llm: LLM

    def _split_image(self, image):
        ...

    def _get_inputs(self, image_patch):
        return {"prompt_token_ids": [1], "multi_modal_data": {"pixel_values": ..., "location_coords": ...}}

    def _aggregate_outputs(self, outputs):
        ...

    def predict(self, image):
        patches = self._split_image(image)
        outputs = self.llm.encode([self._get_inputs(patch) for patch in patches])
        return self._aggregate_outputs(outputs)

This could be an alternative approach, yes. However what is not fully convincing me yet with this approach is that the input processing is done in a separate processor while there is already a BaseMultiModalProcessor that could be enhanced to perform the image splitting and everything. Let me sleep on it :D

Just to verify my understanding, right now, when a user wants to use a vision model in offline mode they have to load the image with PIL and pass it to the LLMEngine. Right? While in serving mode this is done in the server?

@DarkLight1337
Copy link
Member

Just to verify my understanding, right now, when a user wants to use a vision model in offline mode they have to load the image with PIL and pass it to the LLMEngine. Right? While in serving mode this is done in the server?

Yes. But like in the existing example code for your model, you can pass in pixel_values etc. directly if the multi-modal processor supports it

integer, it is used as the level of compilation optimization. If it
is a dictionary, it can specify the full compilation configuration.
process_hidden_states: If True, it loads the hidden states processor
and to process the hiddne states for each request before returning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: hiddne

@maxdebayser
Copy link
Contributor

@DarkLight1337 , I was talking about this the other day with @christian-pinto . The idea here is a little bit different than a custom pooler. In my mind, the pooler running within the GPU model runner should transform the hidden states that come out of the model into tensors or scalars. It shouldn't spend a lot of time running expensive operation in the GPU as that would slow down everything.
In his case here, he wants to transform the pooler output into TIFF images to return to the user directly. So I suggested that we add support for plugins at the output processor level that run in the CPU and outside of the model runner process. But in theory, the plugins could also go into the entrypoint level.

@christian-pinto
Copy link
Contributor Author

Thanks @maxdebayser, that's exactly it. I basically wan to add plugins so that we can generate multimodal-output right away from vLLM.

The reason I extract the hidden states in addition to pooling is because I am not sure whether people might still want to run a pooler as well as convert the hidden states into something else. Is this a possibility?

If not I can reduce this PR to just applying the custom output plugins to the pooler output in the output processor, without returning the hidden states back to . I find this operation to really belong to the OutputProcessor. Also, in this way there would really be any penalty for those models not using these processors. @DarkLight1337 , would this be more inline with your thinking?

@DarkLight1337
Copy link
Member

Yeah I think we should avoid returning the full hidden states to the API process unless necessary, since that would result in quite a bit of communication overhead

@christian-pinto
Copy link
Contributor Author

OK, let me work out a version that works on the pooler output then.

@christian-pinto
Copy link
Contributor Author

Apologies for the silence on this PR, I had to switch to another task for a few days. I'll resume next week.

- "transformers" will use the Transformers model implementation."""
override_attention_dtype: Optional[str] = None
"""Override dtype for attention"""
process_hidden_states: Optional[bool] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can never be None, we shouldn't hint it as Optional

Suggested change
process_hidden_states: Optional[bool] = False
process_hidden_states: bool = False

@christian-pinto
Copy link
Contributor Author

Hey @DarkLight1337 I have created a new PR here: #22820

In the end I ended up liking adding this support at the entrypoint level because it also makes everybody's life easier for enabling the serving mode as well.

@hmellor
Copy link
Member

hmellor commented Aug 30, 2025

Closing as superseded by #22820

@hmellor hmellor closed this Aug 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants