Skip to content

[RFC]: Unified Input Formatting and Processing via Renderer #22880

@ywang96

Description

@ywang96

Motivation.

vLLM’s current input processing pipeline has grown complex and fragmented. Tokenization, chat template formatting, and multimodal input handling are scattered across multiple components (e.g. Processor, InputPreprocessor, MultiModalContentParser, etc). For online serving, prompt formatting, media fetching and tokenzation take place at the API server layer whereas multimodal input processing happens inside AsyncLLM. This requires handling of different input type combinations and therefore introduced unnecessary complexity.

Another issue of the current processing logic is that it's tightly coupled with Hugging Face tokenizers/transformers, therefore makes it non-trivial for model developers to support custom models on vLLM.

As vLLM shifts its focus to be more model developer friendly and easy to hack upon, this refactoring effort aims to reduce these layers of abstraction and level of fragmentation, as well as to make it easier for model providers to plug in custom input handling implementations not based on Hugging Face tokenizers/transformers.

Proposed Change.

In this joint proposal by @WoosukKwon @DarkLight1337 @huachenheli and me, we propose a new component called Renderer, which will serve as an input processor class that unifies Tokenizer, Processor and InputPreprocessor. Renderer’s responsibility is to format and convert high-level API requests (in “OpenAI-style” JSON spec) into token ids, multimodal features and related metadata to be directly consumed by EngineCore, which means all of input preparation is handled in one place.

By moving these steps out of AsyncLLM and into the Renderer (which by default lives at the API server layer, but can be a standalone remote process), we achieve a cleaner separation: the Renderer deals with specs to tokens and features, and AsyncLLM deals with tokens-in-tokens-out.

Another important benefit of this separation is that Renderer's implementation does not have to be based on Hugging Face Transformers (or even Python) as long as the implementation complies with the interface.

Below is what roughly the interface of Renderer looks like:

class Renderer:
    def __init__(self, model_config):
        # initialize with model-specific tokenizer and processor, and any object required by the model
        # defined by the model vendor.
        self.tokenizer = ...
        self.image_processor = ...
        ...

    def render_conversation(
               self, 
               convo: list[ChatCompletionMessageParam]
               ) -> tuple[list[int], Optional[list[MultiModalFeatureSpec]]]:
        """
        Convert an OpenAI-style request (chat or completion) into prompt token ids and optionally 
        multimodal features with metadata.
        """

        # Model vendor & developer has full control on how to define this conversion logic, but typically this
        # looks like:
        # 1. Convert list of messages to a single string-format prompt
        # 2. Tokenize the prompt to prompt token ids
        # 3. (Optional) Fetch multimodal media contents and process them into features (as inputs of multimodal encoder) 
        #  with metadata, and process input token ids to expand placeholder tokens. 
        pass

    def render_prompt(
              self, 
              prompt: str, 
              multi_modal_data: NotRequired["MultiModalDataDict"], 
              mm_processor_kwargs: NotRequired[dict[str, Any]]
              ) -> tuple[list[int], Optional[list[MultiModalFeatureSpec]]]:
        """
        Used for `.generate` endpoint, and can be used inside `render_conversation` if defined.
        """ 
        pass

One core data class to encapsulate multimodal related inputs is MultiModalFeatureSpec.

class MultiModalFeatureSpec:
    """
    Encapsulates multimodal related inputs and metadata required by the EngineCore. 

    MultiModalFeatureSpec corresponds to individual mm data items (e,g one image). For instance, a request with 
    5 images will return a list of 5 MultiModalFeatureSpec's.
    """

    data: Optional[dict[str, Union[torch.Tensor, Sequence[torch.Tensor]]] # e.g, {"pixel_values": ..., "image_grid_thw": ...} 
    modality: str # Based on the input, e.g. "image"
    mm_identifier: str # mm_hash or uuid
    mm_position: PlaceholderRange # e.g, PlaceholderRange(offset=2, length=336)

The dataflow looks like the following:
Image

Q: Why is it called Renderer?
A: We take inspirations from https://github.com/openai/harmony for the naming.

Q: Where is Renderer running at?
A: By default, Renderer runs in P0 at the same layer as the API server, but it can also be hosted standanlone remotely. See #22817 for more details on disaggregated input processing.

Q: Does this work with EPD Disaggregation?
A: Yes - developers can choose to include multimodal encoder as part of Renderer and process media into embeddings remotely, then the rest should work out of box with how vLLM currently supports image_embeds as one of MultiModalFeatureSpec.data kwargs for a selection of models.

Q: How does this work with the current llm.generate API?
A: render_prompt by default tokenizes prompt for text-only models. For multimodal models, it's typically easier to use the llm.chat interface for end-to-end inference, otherwise it's model vendor's responsibility to implement render_prompt.

Q: How does multimodal input caching works?
A: As discussed in #22198, P0 does not need to store any actual MultiModalFeatureSpec.data but simply MultiModalFeatureSpec.mm_identifier in a LRU map of strings. This cache should have a matching set of keys to the P1 input cache where the processed inputs are actually stored. We will provide this hook for model developers to include in their render_conversation and/or render_prompt definition if they choose to enable this feature.

One example implementation looks like the following:

    def __init__(self, model_config):
        self.cache = LRUCache(model_config.mm_processor_cache_gb)
        ...

    def render_conversation(
               self, 
               convo: list[ChatCompletionMessageParam]
               ) -> tuple[list[int], Optional[list[MultiModalFeatureSpec]]]:
        
        ...
        mm_features:  list[MultiModalFeatureSpec] = [] 
        for media_item in fetched_media_items:
            mm_identifier = hash(media_item)
            if self.cache.get(mm_identifier):
                mm_features.append(MultiModalFeatureSpec(
                                                data=None, 
                                                modality: xxx, 
                                                mm_identifier=mm_identifier, 
                                                mm_position: yyyy)
            else:
                mm_inputs = self.processor(media_item)
                # Importantly, here we only calculate the size of mm_inputs but don't actually store it
                self.cache.put(mm_identifier, mm_inputs)
                mm_features.append(MultiModalFeatureSpec(
                                                data=mm_inputs, 
                                                modality: xxx, 
                                                mm_identifier=mm_identifier, 
                                                mm_position: yyyy)
          ...

Feedback Period.

No response

CC List.

@WoosukKwon @zhuohan123 @robertgshaw2-redhat @simon-mo @DarkLight1337 @Isotr0py @huachenheli @yeqcharlotte

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Pinned

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions