[RFC]: Unified Input Formatting and Processing via `Renderer`

### Motivation.

vLLM’s current input processing pipeline has grown complex and fragmented. Tokenization, chat template formatting, and multimodal input handling are scattered across multiple components (e.g. `Processor`, `InputPreprocessor`,  `MultiModalContentParser`, etc). For online serving, prompt formatting, media fetching and tokenzation take place at the API server layer whereas multimodal input processing happens inside `AsyncLLM`. This requires handling of different input type combinations and therefore introduced unnecessary complexity.

Another issue of the current processing logic is that it's tightly coupled with Hugging Face `tokenizers`/`transformers`, therefore makes it non-trivial for model developers to support custom models on vLLM.

As vLLM shifts its focus to be more model developer friendly and easy to hack upon, this refactoring effort aims to reduce these layers of abstraction and level of fragmentation, as well as to make it easier for model providers to plug in custom input handling implementations not based on Hugging Face `tokenizers`/`transformers`.

### Proposed Change.

In this joint proposal by @WoosukKwon @DarkLight1337 @huachenheli and me, we propose a new component called `Renderer`, which will serve as an input processor class that unifies `Tokenizer`, `Processor` and `InputPreprocessor`. Renderer’s responsibility is to format and convert high-level API requests (in “OpenAI-style” JSON spec) into token ids, multimodal features and related metadata to be directly consumed by `EngineCore`, which means all of input preparation is handled in one place. 

By moving these steps out of `AsyncLLM` and into the `Renderer` (which by default lives at the API server layer, but can be a standalone remote process), we achieve a cleaner separation: the `Renderer` deals with **specs to tokens and features**, and `AsyncLLM` deals with **tokens-in-tokens-out**. 

Another important benefit of this separation is that `Renderer`'s implementation does not have to be based on Hugging Face Transformers (or even Python) as long as the implementation complies with the interface.

Below is what roughly the interface of `Renderer` looks like:

```python
class Renderer:
    def __init__(self, model_config):
        # initialize with model-specific tokenizer and processor, and any object required by the model
        # defined by the model vendor.
        self.tokenizer = ...
        self.image_processor = ...
        ...

    def render_conversation(
               self, 
               convo: list[ChatCompletionMessageParam]
               ) -> tuple[list[int], Optional[list[MultiModalFeatureSpec]]]:
        """
        Convert an OpenAI-style request (chat or completion) into prompt token ids and optionally 
        multimodal features with metadata.
        """

        # Model vendor & developer has full control on how to define this conversion logic, but typically this
        # looks like:
        # 1. Convert list of messages to a single string-format prompt
        # 2. Tokenize the prompt to prompt token ids
        # 3. (Optional) Fetch multimodal media contents and process them into features (as inputs of multimodal encoder) 
        #  with metadata, and process input token ids to expand placeholder tokens. 
        pass

    def render_prompt(
              self, 
              prompt: str, 
              multi_modal_data: NotRequired["MultiModalDataDict"], 
              mm_processor_kwargs: NotRequired[dict[str, Any]]
              ) -> tuple[list[int], Optional[list[MultiModalFeatureSpec]]]:
        """
        Used for `.generate` endpoint, and can be used inside `render_conversation` if defined.
        """ 
        pass
```

One core data class to encapsulate multimodal related inputs is `MultiModalFeatureSpec`.
```python
class MultiModalFeatureSpec:
    """
    Encapsulates multimodal related inputs and metadata required by the EngineCore. 

    MultiModalFeatureSpec corresponds to individual mm data items (e,g one image). For instance, a request with 
    5 images will return a list of 5 MultiModalFeatureSpec's.
    """

    data: Optional[dict[str, Union[torch.Tensor, Sequence[torch.Tensor]]] # e.g, {"pixel_values": ..., "image_grid_thw": ...} 
    modality: str # Based on the input, e.g. "image"
    mm_identifier: str # mm_hash or uuid
    mm_position: PlaceholderRange # e.g, PlaceholderRange(offset=2, length=336)
```

The dataflow looks like the following:
![Image](https://github.com/user-attachments/assets/927dfd4d-8d27-4911-bf89-f01cff5b2fe5)

**Q**: Why is it called `Renderer`?
**A**: We take inspirations from https://github.com/openai/harmony for the naming.

**Q**: Where is `Renderer` running at?
**A**: By default, `Renderer` runs in P0 at the same layer as the API server, but it can also be hosted standanlone remotely. See https://github.com/vllm-project/vllm/issues/22817 for more details on disaggregated input processing.

**Q**: Does this work with [EPD Disaggregation](https://arxiv.org/abs/2501.05460)?
**A**: Yes - developers can choose to include multimodal encoder as part of `Renderer` and process media into embeddings remotely, then the rest should work out of box with how vLLM currently supports `image_embeds` as one of `MultiModalFeatureSpec.data` kwargs for a selection of models.

**Q**: How does this work with the current `llm.generate` API?
**A**: `render_prompt` by default tokenizes `prompt` for text-only models. For multimodal models, it's typically easier to use the `llm.chat` interface for end-to-end inference, otherwise it's model vendor's responsibility to implement `render_prompt`.

**Q**: How does multimodal input caching works?
**A**: As discussed in https://github.com/vllm-project/vllm/pull/22198, P0 does not need to store any actual `MultiModalFeatureSpec.data` but simply `MultiModalFeatureSpec.mm_identifier` in a LRU map of strings. This cache should have a matching set of keys to the P1 input cache where the processed inputs are actually stored. We will provide this hook for model developers to include in their `render_conversation` and/or `render_prompt` definition if they choose to enable this feature.

One example implementation looks like the following:
```python
    def __init__(self, model_config):
        self.cache = LRUCache(model_config.mm_processor_cache_gb)
        ...

    def render_conversation(
               self, 
               convo: list[ChatCompletionMessageParam]
               ) -> tuple[list[int], Optional[list[MultiModalFeatureSpec]]]:
        
        ...
        mm_features:  list[MultiModalFeatureSpec] = [] 
        for media_item in fetched_media_items:
            mm_identifier = hash(media_item)
            if self.cache.get(mm_identifier):
                mm_features.append(MultiModalFeatureSpec(
                                                data=None, 
                                                modality: xxx, 
                                                mm_identifier=mm_identifier, 
                                                mm_position: yyyy)
            else:
                mm_inputs = self.processor(media_item)
                # Importantly, here we only calculate the size of mm_inputs but don't actually store it
                self.cache.put(mm_identifier, mm_inputs)
                mm_features.append(MultiModalFeatureSpec(
                                                data=mm_inputs, 
                                                modality: xxx, 
                                                mm_identifier=mm_identifier, 
                                                mm_position: yyyy)
          ...
```

### Feedback Period.

_No response_

### CC List.

@WoosukKwon @zhuohan123 @robertgshaw2-redhat @simon-mo @DarkLight1337 @Isotr0py @huachenheli @yeqcharlotte 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Unified Input Formatting and Processing via `Renderer` #22880

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Unified Input Formatting and Processing via Renderer #22880

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[RFC]: Unified Input Formatting and Processing via `Renderer` #22880