Skip to content

[RFC]: Multi-modality Support on vLLM #4194

@ywang96

Description

@ywang96

Active Projects (help wanted!):

Update [11/18] - In the upcoming months, we will focus on performance optimization for multimodal models as part of vLLM V1 engine re-arch effort

P0 (We will definitely work on them):

P1 (We should be aware of these and spend some time if possible):

P2 (We should work on these when they become more important/frequently requested):


Update [9/8] - We have finished majority of the refactoring and made extensive progress for supporting multimodal models. See details here.

Roadmap for Q3 2024

In the upcoming months, we will focus on enabling multimodal models to be compatible with other performance-related features on vLLM as well as collaborating with model vendors to directly onboard new multimodal models.

P0 (We will definitely work on them):

P1 (We should be aware of these and spend some time if possible):

P2 (We should work on these when they become more important/frequently requested):


Update [7/3] - We have finished our 2nd refactoring milestone - see details here.

Roadmap for 3rd Milestone In the upcoming months, we will focus on wrapping up the main goal of this refactoring RFC and supporting more models and modalities.

P0 (We will definitely work on these):

P1 (We should be aware of these and spend some time if possible):

P2 (We should work on these when they become more frequently requested) Help wanted!:


Update [6/11] - We have finished our 1st refactoring milestone - see details here.

Roadmap for 2nd Milestone Some of the items @DarkLight1337, @xwjiang2010 and I are looking to work on as part of the next milestone are tentatively:

API Changes: A list of user-facing breaking changes can be found here

Performance related

Model support - Add more vision language models, and better developer facing documentation

Some of the ideas that we should work on in the future:

  • Make VLMs work with chunked prefill
  • Unify tokenizer & multi-modal processor (so that we can leverage AutoProcessor from transformers)
  • Prefix caching for images
  • Streaming inputs of multi-modal data

As always, please provide feedback and feature requests in this issue. Suggestions and contributions are very welcomed!


Original RFC Multi-modality support was brought to vLLM recently, much thanks to https://github.com//pull/3042 from @xwjiang2010. Since then we have seen an increasing amount of interest in such models (from the number of pull requests and issues related). However, there are a few issues we should address with the current design before we bring in more features around multi-modality.
  1. VisionLanguageConfig and MultiModalData

    • Currently the multimodal input can be either pixel_values or image_feaures for simplicity. While this works well with llava 1.5 where pixel_values are the only output from its ClipImageProcessor, this does not work well when it comes to supporting models with more complicated preprocessing to return multiple outputs.(e.g, llava 1.6, fuyu, etc). Developers could add additional preprocessing inside model implementation as a workaround, but this will be unmaintainable over time.

    • The overhead of requiring image_feature_size, image_token_id and image_input_shape is pushed to the user when these can/should be inferred from the model & processor config and not required at the inference time.

  2. The current design assumes multi-modal inputs are already processed to be consumed by the model executable, but vLLM does not have a processor util. This blocks the vision model support on the OpenAI API server for end-to-end inference.

  3. The current prompt format "<Image>" * 576 + prompt makes the underlying implementation easier (especially when it comes to profiling), but complicates the user experience compared to huggingface format "<Image>\n" + prompt and that has caused some confusion on what's needed to make multi-model work on vLLM.

Proposal
Most items in the above issues have been discussed and addressed in the original Llava1.5 PR as well as #3978. We propose a few high-level design decisions for the refactoring and welcome any feedback!

  1. Adding a processor util - We can leverage out-of-box AutoProcessor from transformers the same way we have been doing with tokenizer as an attribute of LLMEngine (e.g., self.multi_modal_processor = AutoProcessor(model)). This allows us to support end-to-end inference with the API server as well as the LLM object.

  2. Frontend input format: Because of 1, we can keep the same format as HuggingFace since that's how users usually discover new models and it makes end-to-end integration test easier. Preprocessing should be hidden away from the interface and user. For example, this preprocessing step can be done inside LLMEngine.add_request() around the same place as

    if arrival_time is None:
    arrival_time = time.time()
    prompt_token_ids = self.encode_request(
    request_id=request_id,
    prompt=prompt,
    prompt_token_ids=prompt_token_ids,
    lora_request=lora_request)

    Here's a pesudocode

if multi_modal_input is None:
   prompt_token_ids = self.encode_request( 
       request_id=request_id, 
       prompt=prompt, 
       prompt_token_ids=prompt_token_ids, 
       lora_request=lora_request)
else:
   # preprocessed_inputs is a dictionary of key(str)-value(tensor)
   # as output of self.multi_modal_processor
   preprocessed_inputs = self.preprocess_request(
       request_id=request_id, 
       prompt=prompt, 
       prompt_token_ids=prompt_token_ids, 
       lora_request=lora_request,
       multi_modal_input=images)
   prompt_token_ids = preprocessed_inputs.pop("input_ids")
   multi_modal_data = MultiModalData(data=preprocessed_inputs)
...

and thus at LLM level, only image tensors will be required.

  1. Refactor MultiModalData: Now this object simply holds the multi-modal data dictionary that we need for the model_executable. At inference time, data is unpacked in the forward pass - this approach is similar to transformer implementation of multi-modal models.
  2. Refactor VisionLanguageConfig: This config is a lot simpler now. One caveat is that sometimes when the image features can be dynamic, users may specify an optional max_feature_size to help engine run the profiling for the worst-case scenario as well as to potentially abort certain requests.
  3. Regarding the original image_feature as input type design: IMO LlaVA is a special case among multi-modal models since its vision encoder is detached from the language model and can be initialized separately, but in this case, one could argue that for the MultiModalProjector as well, and perhaps passing image_feature (outputs of CLIP) is a design decision not generalizable to all other models. Instead, passing multi-modal embeddings (outputs of CLIP -> Projector) at inference time is more flexible and should work nicely with other models. (One followup question is, does it make sense to actually define a separate Llava-no-clip module, since this is so specific to llava, to make our life easier?)

With the above changes, as an end-user, ideally you then should be able to do something like the following

from PIL import Image
from vllm import LLM
from vllm.config import VisionLanguageConfig

model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
llm = LLM(model=model_id, multi_modal_input_type=VisionLanguageConfig.IMAGE_INPUT_TYPE.IMAGE) # This can also be EMBEDDINGS

prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

llm.generate(prompt, ..., multi_modal_input=image)

Under the hood, the pipeline is

prompt, image
-> prompt_token_ids, MultiModalData(data=preprocessed_inputs) # through preprocess within engine.add_request() 
-> prompt_token_ids, pixel_values, image_sizes  # though unpacking in implementation of model's `forward`.

I will follow up with a series of PR for refactoring but please leave any feedback since this is a pretty significant interface change.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

Status

Pinned

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions