[Core] Hidden State Processors via plugins #21621

christian-pinto · 2025-07-25T15:22:18Z

Purpose

This PR proposes an initial implementation of hidden states processors via plugins. This PR is in the context of #16052 and #12249.

What I propose is that hidden states can be processed before returning the request output to the user by means of loadable plugins. Plugins are defined in the same spirit of platform and generic plugins, here I propose adding a new plugins group vllm.hidden_states_processor_plugins.

In the code I am defining an example default identity plugin, just returning the hidden states as they are and I have an example OOT plugin here.

In the current implementation I only support pooling models, but we can support others once I understand whether it actually makes sense to process hidden states for text generating models. If we go down that way, should we accumulate hidden states for the whole sequence? It seems a lot of data to me and I would still do it for the final ones.

Hidden states processors are a per model instance feature (i.e., all requests will be processed with the same plugin) and
I define an env variable VLLM_USE_HIDDEN_STATES_PROCESSOR that can be used which plugin to instantiate at model loading time in the case multiple plugins are available. In case the variable is not set, the "first" available plugin is loaded.

Also, in this implementation I process the hidden states in the output processor to avoid blocking the model runner. Talking to @maxdebayser one could also think of using a thread pool to decouple the execution of these tasks for other activities.

Comments, suggestions, ideas are all welcome

@DarkLight1337 @maxdebayser @youkaichao @mgazz

Test Plan

I will add proper tests once we are OK with the implementation

Test Result

(Optional) Documentation Update

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

github-actions · 2025-07-25T15:22:30Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This PR introduces a plugin system for processing hidden states, which is a great addition for extensibility. The implementation is mostly solid, but I've found a few critical issues related to error handling in the plugin loader and a potential regression in the model runner that could break KV cache transfer. Addressing these will improve the robustness and correctness of the new feature.

vllm/plugins/hidden_states_processors/__init__.py

vllm/v1/worker/gpu_model_runner.py

gemini-code-assist · 2025-07-25T15:23:35Z

vllm/plugins/hidden_states_processors/__init__.py

+        except Exception:
+            pass


The broad except Exception: pass will silently ignore any errors during plugin loading, which can make debugging very difficult. For instance, if a plugin's entry point has a bug, it will be skipped without any notification. It's better to log the exception to provide visibility into loading failures.

Suggested change

except Exception:

pass

except Exception:

logger.warning("Failed to load hidden states processor plugin '%s'.",

name,

exc_info=True)

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

DarkLight1337 · 2025-07-25T15:55:32Z

QQ: Why can't we use custom poolers to achieve the same thing?

christian-pinto · 2025-07-28T09:03:12Z

QQ: Why can't we use custom poolers to achieve the same thing?

The REason I am proposing it this way is for the case where a single request is split in multiple child requests, and the hidden states will have to be processed all at once when the requests are completed.

Let me give an example with the Prithvi model (I believe applies to other models too): The user submits an image (geotiff, a special type of .tiff file) to vLLM and after the pre-processing a number of child requests are generated for the model depending on the batch size used for pre-processing. All requests linked to that one initial request are processed and only then I can re-build the output frame with the HiddenStatesProcessor. If I were to do it at the pooling stage (and correct me if I am wrong), I would not be able to do it because vLLM might be batching together data from different requests and the output transformations might not necessarily be applied across requests. Also, I would not have visibility on whether all the child requests are completed and therefore not sure if the output can be processed.

Doing it in the output processor gives me the impressione that I could more easily aggregate hidden states for all child requests and then process them at once.

Is your concern related to having to pass the hidden states between processes like what happens in this proposal?

DarkLight1337 · 2025-07-28T09:14:44Z

The pooler in V1 can be applied to multiple requests at once, where the hidden_states comes in a list where each element corresponds to one request. It is only applied on the final hidden states (which in a generative model, would otherwise be converted into logits)

cc @maxdebayser as he should be able to provide more detailed information regarding this

christian-pinto · 2025-07-29T09:38:43Z

The pooler in V1 can be applied to multiple requests at once, where the hidden_states comes in a list where each element corresponds to one request. It is only applied on the final hidden states (which in a generative model, would otherwise be converted into logits)

cc @maxdebayser as he should be able to provide more detailed information regarding this

Apologies, I thought I had replied to this one.

The pooler is applied to a list depending on the number of scheduled tokens (gpu_model_runner.py:1371). The match with requests is instead done later in the scheduler where in update_from_output(). In the case of the Prithvi model I see a 1-to-1 match between request and tokens scheduled since I schedule one fake token with every request. So in general yes, the hidden states processing could be applied in the pooler.

The question I have open are:

What if one request ends up spawning a number of child requests? I am planning on doing this after pre-processing the image with one child handling a batch of input patches. Having the hidden states in the pooler would not work in this case because the child requests are re-conciled only in the output processor and. I need the processor to be applies to the output of all patches altogether. This could be the case for other models as well going forward if we attract more multi-model output ones.
In your initial proposal of hidden states procesor I see you had the processor generate a tensor in Output. In my view the processor should be allowed to return any object. I could see a processor generating an image objec, or even writing it to file (offline inference). Any takes on this?

DarkLight1337 · 2025-07-29T09:47:35Z

What if one request ends up spawning a number of child requests?

From my understanding, that's just not how the scheduler works. There is a strict one-to-one relationship between prompt and request.

The processor should be allowed to return any object

I think the major roadblock to this is the serialization/deserialization process which currently only supports a limited set of data types. It would be difficult to extend it to "arbitrary" types, not just in terms of code complexity, but also because of security reasons.

DarkLight1337 · 2025-07-29T09:48:53Z

Maybe @WoosukKwon @ywang96 can provide more insights re: scheduling

christian-pinto · 2025-07-29T09:57:08Z

What if one request ends up spawning a number of child requests?

From my understanding, that's just not how the scheduler works. There is a strict one-to-one relationship between prompt and request.

I see there are cases where a requests fans out a number of child requests right after the input processor is applied. See AsyncLLM.add_request(). The scheduler treats those as separate requests and they are re-aggregated back in the Ouutput processor. That is wha I had in mind to exploit in the future.

The processor should be allowed to return any object

I think the major roadblock to this is the serialization/deserialization process which currently only supports a limited set of data types. It would be difficult to extend it to "arbitrary" types, not just in terms of code complexity, but also because of security reasons.

This and the above is the reason why I am proposing processing of the hidden states in the OutputProcessor.

DarkLight1337 · 2025-07-29T10:09:22Z

Oh, I see what you mean now. You want to aggregate the requests at AsyncLLM level instead of inside the model runner? That's outside the scope of the scheduler which is inside EngineCore.

DarkLight1337 · 2025-07-29T10:11:58Z

In that case I think the naming of hidden state processors is a bit confusing since I usually associate it with model runner (normally, the hidden states are converted into logits inside the model runner).

Instead we can call this HiddenStatesAggregator or OutputAggregator

DarkLight1337 · 2025-07-29T10:23:13Z

If that's the case. I think it would be less intrusive to simply have a wrapper around LLM class:

class LLMForImagePatches:
    llm: LLM

    def _split_image(self, image):
        ...

    def _get_inputs(self, image_patch):
        return {"prompt_token_ids": [1], "multi_modal_data": {"pixel_values": ..., "location_coords": ...}}

    def _aggregate_outputs(self, outputs):
        ...

    def predict(self, image):
        patches = self._split_image(image)
        outputs = self.llm.encode([self._get_inputs(patch) for patch in patches])
        return self._aggregate_outputs(outputs)

DarkLight1337 · 2025-07-29T10:26:32Z

There is no performance benefit from implementing this inside LLM/AsyncLLM because it's in the same process as the user code / API server.

christian-pinto · 2025-07-29T10:31:02Z

Oh, I see what you mean now. You want to aggregate the requests at AsyncLLM level instead of inside the model runner? That's outside the scope of the scheduler which is inside EngineCore.

Exactly, in a soon coming PR I will update the input processor for the prithvi model so that I can provide an image in input and depending on the batching to be used generate a number of child requests in the AsyncLLM engine.

In that case I think the naming of hidden state processors is a bit confusing since I usually associate it with model runner (normally, the hidden states are converted into logits inside the model runner).
Instead we can call this OutputAggregator

OK, I see your point now. And yes, I want to aggregate the hidden states for 1 request (and child ones if available) and post-process them to whatever format the user wants. This implies that either I apply the custom output processor to the aggregated pooling output or, I extract the hidden states and apply the custom processor to their aggregation.
I can go ahead with the processing output of the pooler unless people instead of extracting hidden states, unless people find it useful to both apply a pooler and pos-process the hiddens states into something else.

hmellor · 2025-07-29T13:27:13Z

vllm/engine/arg_utils.py

        MultiModalConfig.mm_processor_kwargs
    disable_mm_preprocessor_cache: bool = \
        MultiModalConfig.disable_mm_preprocessor_cache
+    process_hidden_states: bool = False


With this change, the default is only defined in one place

Suggested change

process_hidden_states: bool = False

process_hidden_states: bool = ModelConfig.process_hidden_states

mergify · 2025-07-29T13:27:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @christian-pinto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

christian-pinto · 2025-07-29T14:06:49Z

If that's the case. I think it would be less intrusive to simply have a wrapper around LLM class:

class LLMForImagePatches:
    llm: LLM

    def _split_image(self, image):
        ...

    def _get_inputs(self, image_patch):
        return {"prompt_token_ids": [1], "multi_modal_data": {"pixel_values": ..., "location_coords": ...}}

    def _aggregate_outputs(self, outputs):
        ...

    def predict(self, image):
        patches = self._split_image(image)
        outputs = self.llm.encode([self._get_inputs(patch) for patch in patches])
        return self._aggregate_outputs(outputs)

This could be an alternative approach, yes. However what is not fully convincing me yet with this approach is that the input processing is done in a separate processor while there is already a BaseMultiModalProcessor that could be enhanced to perform the image splitting and everything. Let me sleep on it :D

Just to verify my understanding, right now, when a user wants to use a vision model in offline mode they have to load the image with PIL and pass it to the LLMEngine. Right? While in serving mode this is done in the server?

DarkLight1337 · 2025-07-29T14:11:24Z

Just to verify my understanding, right now, when a user wants to use a vision model in offline mode they have to load the image with PIL and pass it to the LLMEngine. Right? While in serving mode this is done in the server?

Yes. But like in the existing example code for your model, you can pass in pixel_values etc. directly if the multi-modal processor supports it

maxdebayser · 2025-07-29T19:15:54Z

vllm/entrypoints/llm.py

            integer, it is used as the level of compilation optimization. If it
            is a dictionary, it can specify the full compilation configuration.
+        process_hidden_states: If True, it loads the hidden states processor
+            and to process the hiddne states for each request before returning


typo: hiddne

maxdebayser · 2025-07-29T19:36:38Z

@DarkLight1337 , I was talking about this the other day with @christian-pinto . The idea here is a little bit different than a custom pooler. In my mind, the pooler running within the GPU model runner should transform the hidden states that come out of the model into tensors or scalars. It shouldn't spend a lot of time running expensive operation in the GPU as that would slow down everything.
In his case here, he wants to transform the pooler output into TIFF images to return to the user directly. So I suggested that we add support for plugins at the output processor level that run in the CPU and outside of the model runner process. But in theory, the plugins could also go into the entrypoint level.

christian-pinto · 2025-07-31T07:56:44Z

Thanks @maxdebayser, that's exactly it. I basically wan to add plugins so that we can generate multimodal-output right away from vLLM.

The reason I extract the hidden states in addition to pooling is because I am not sure whether people might still want to run a pooler as well as convert the hidden states into something else. Is this a possibility?

If not I can reduce this PR to just applying the custom output plugins to the pooler output in the output processor, without returning the hidden states back to . I find this operation to really belong to the OutputProcessor. Also, in this way there would really be any penalty for those models not using these processors. @DarkLight1337 , would this be more inline with your thinking?

DarkLight1337 · 2025-07-31T08:20:42Z

Yeah I think we should avoid returning the full hidden states to the API process unless necessary, since that would result in quite a bit of communication overhead

christian-pinto · 2025-07-31T10:23:18Z

OK, let me work out a version that works on the pooler output then.

christian-pinto · 2025-08-07T09:26:07Z

Apologies for the silence on this PR, I had to switch to another task for a few days. I'll resume next week.

hmellor · 2025-08-07T10:36:58Z

vllm/config.py

    - "transformers" will use the Transformers model implementation."""
    override_attention_dtype: Optional[str] = None
    """Override dtype for attention"""
+    process_hidden_states: Optional[bool] = False


If this can never be None, we shouldn't hint it as Optional

Suggested change

process_hidden_states: Optional[bool] = False

process_hidden_states: bool = False

christian-pinto · 2025-08-13T14:43:01Z

Hey @DarkLight1337 I have created a new PR here: #22820

In the end I ended up liking adding this support at the entrypoint level because it also makes everybody's life easier for enabling the serving mode as well.

hmellor · 2025-08-30T14:57:57Z

Closing as superseded by #22820

Initial implementation of hidden_state_processors via plugins

92715bd

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

christian-pinto requested review from WoosukKwon, aarnphm, alexm-redhat, comaniac, hmellor, houseroad, mgoin, njhill, robertgshaw2-redhat, simon-mo, tlrmchlsmth, youkaichao and ywang96 as code owners July 25, 2025 15:22

mergify bot added frontend v1 labels Jul 25, 2025

gemini-code-assist bot reviewed Jul 25, 2025

View reviewed changes

Some minor fixes

3f81e82

Signed-off-by: Christian Pinto <christian.pinto@ibm.com>

DarkLight1337 self-assigned this Jul 28, 2025

hmellor reviewed Jul 29, 2025

View reviewed changes

mergify bot added the needs-rebase label Jul 29, 2025

maxdebayser reviewed Jul 29, 2025

View reviewed changes

hmellor reviewed Aug 7, 2025

View reviewed changes

maxdebayser mentioned this pull request Aug 12, 2025

Support bge-m3 sparse embeddings (lexical weights) #14526

Open

christian-pinto mentioned this pull request Aug 13, 2025

[Misc] IO Processor plugins for pooling models #22820

Merged

hmellor closed this Aug 30, 2025

	process_hidden_states: bool = False
	process_hidden_states: bool = ModelConfig.process_hidden_states

	process_hidden_states: Optional[bool] = False
	process_hidden_states: bool = False

Uh oh!

Uh oh!

[Core] Hidden State Processors via plugins #21621

[Core] Hidden State Processors via plugins #21621

Uh oh!

Conversation

christian-pinto commented Jul 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 25, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jul 25, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 commented Jul 25, 2025

Uh oh!

christian-pinto commented Jul 28, 2025

Uh oh!

DarkLight1337 commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christian-pinto commented Jul 29, 2025

Uh oh!

DarkLight1337 commented Jul 29, 2025

Uh oh!

DarkLight1337 commented Jul 29, 2025

Uh oh!

christian-pinto commented Jul 29, 2025

Uh oh!

DarkLight1337 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christian-pinto commented Jul 29, 2025

Uh oh!

hmellor Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jul 29, 2025

Uh oh!

christian-pinto commented Jul 29, 2025

Uh oh!

DarkLight1337 commented Jul 29, 2025

Uh oh!

maxdebayser Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

maxdebayser commented Jul 29, 2025

Uh oh!

christian-pinto commented Jul 31, 2025

Uh oh!

DarkLight1337 commented Jul 31, 2025

Uh oh!

christian-pinto commented Jul 31, 2025

Uh oh!

christian-pinto commented Aug 7, 2025

Uh oh!

hmellor Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

christian-pinto commented Aug 13, 2025

Uh oh!

christian-pinto commented Jul 25, 2025 •

edited by github-actions bot

Loading

DarkLight1337 commented Jul 28, 2025 •

edited

Loading

DarkLight1337 commented Jul 29, 2025 •

edited

Loading

DarkLight1337 commented Jul 29, 2025 •

edited

Loading

DarkLight1337 commented Jul 29, 2025 •

edited

Loading

DarkLight1337 commented Jul 29, 2025 •

edited

Loading