-
-
Notifications
You must be signed in to change notification settings - Fork 9.1k
Description
Active Projects (help wanted!):
Update [11/18] - In the upcoming months, we will focus on performance optimization for multimodal models as part of vLLM V1 engine re-arch effort
P0 (We will definitely work on them):
- V1 re-arch for multimodal models - See high-level design (Slides, Doc)
- Core
- [1/N] [V1] Support VLMs with fine-grained scheduling #9871
- [2/N] [V1] Refactor model executable interface for all text-only language models #10374
- [3/N] [V1] Refactor model executable interface for multimodal models #10570
- [4/N] [V1] Initial support of multimodal models for V1 re-arch #10699
- [5/N] [V1][VLM] Proper memory profiling for image language models #11210
- [6/N] [V1] Add V1 support of Qwen2-VL #12128
- [7/N] Enable rest of single-modality LMMs on V1
- [V1][VLM] V1 support for selected single-image models. #11632 (Aria, BLIP-2, Chameleon, Fuyu)
- [VLM] Support Pixtral-HF on V1 #14275
- [V1] Extend beyond image modality and support mixed-modality inference with Llava-OneVision #11685
- [V1] Support audio language models on V1 #11733
- [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM #12069
- [VLM] Merged multi-modal processor and V1 support for Qwen-VL #12504
- [VLM] Implement merged multimodal processor and V1 support for idefics3 #12660
- [Model] MiniCPM-V/O supports V1 #15487
- [8/N] Enable mixed-modality inference on V1
- [9/N] Enable interleaved-modality inference on V1
- Multimodal prefix caching
- Multimodal input & embedding caching
- [V1] VLM preprocessor hashing #11020
- [VLM] Support caching in merged multi-modal processor #11396
- [VLM] Limit multimodal input cache by memory #14805
- [V1] Remove input cache client #14864
- Reuse multimodal embeddings from encoder cache
- Core
- [RFC]: Merge input processor and input mapper for multi-modal models #10114
P1 (We should be aware of these and spend some time if possible):
- More efficient multimodal input data processing
- Quantization for LMMs
- LoRA for LMMs
- Consolidate ViT attention backend
- V1 spec decode for VLMs
- Update developer facing documentation for V1 re-arch multimodal models.
P2 (We should work on these when they become more important/frequently requested):
- Enhance multimodal support for OpenAI-compatible server
- Next steps for Multimodal Llama
- Better encoder cache & compute budget strategy
- Better profiling strategy
- Prototype separating vision encoder to its own worker (fully disaggregated from decoder)
Update [9/8] - We have finished majority of the refactoring and made extensive progress for supporting multimodal models. See details here.
Roadmap for Q3 2024
In the upcoming months, we will focus on enabling multimodal models to be compatible with other performance-related features on vLLM as well as collaborating with model vendors to directly onboard new multimodal models.
P0 (We will definitely work on them):
- [RFC]: Merge input processor and input mapper for multi-modal models #10114
- [0/N] Rename
MultiModalInputs
toMultiModalKwargs
#10040 - [1/N] Initial prototype for multi-modal processor #10044
- [2/N] Convert LLaVA-1.5, Phi-3-Vision, Qwen2-VL and Ultravox to multi-modal processor as POC and add tests
- [3/N] Deprecate the old code for input processor/mapper so external developers have time to convert
- [4/N] Convert the rest of the built-in vLLM models to multi-modal processor
- [5/N] Remove the old code for input processor/mapper
- [0/N] Rename
- Proper chunked prefill with multimodal input
- Prefix caching with multimodal input
- Enable flamingo-style multimodal models (e.g., Multimodal Llama)
- Fully enable video input, and therefore, mixed multi-modal input
- Update OpenAI-compatible server to use OpenAI Audio API
- Multimodal embedding models
- [Model] VLM2Vec, the first multimodal embedding model in vLLM #9303
- [Model] Support E5-V #9576
- [Frontend] Chat-based Embeddings API #9759
- [Frontend] Use a proper chat template for VLM2Vec #9912
- [Model] Adding Support for Qwen2VL as an Embedding Model. Using MrLight/dse-qwen2-2b-mrl-v1 #9944
- [Frontend] Automatic detection of chat content format from AST #9919
- Shepherd model support directly from model vendor
- Pixtral #8377
- [Model][VLM] Add Qwen2-VL model support #7905
- [Model][VLM] Add LLaVA-Onevision model support #8486
- [Model] Add support for the multi-modal Llama 3.2 model #8811
- [Model] Add GLM-4v support and meet vllm==0.6.2 #9242
- [Model] Molmo vLLM Integration #9016
- [Model] Add Qwen2-Audio model support #9248
P1 (We should be aware of these and spend some time if possible):
- Better profiling strategy for multimodal models
- Multi-input support for more compatible models
- Better developer facing documentation for adding new models
- Add more multimodal models, and shepherd model support from community contributions
- Misc bug fixes
P2 (We should work on these when they become more important/frequently requested):
- Multimodal models with LoRA
- [Kernel][LoRA] Add assertion for punica sgmv kernels #7585
- [Model][LoRA]LoRA support added for MiniCPMV2.5 #7199
- [Model][LoRA]LoRA support added for MiniCPMV2.6 #8943
- [Model][LoRA]LoRA support added for Qwen #9622
- [Model][LoRA]LoRA support added for Qwen2VLForConditionalGeneration #10022
- [Model][LoRA]LoRA support added for idefics3 #10281
- [Feature]: LoRA support for Pixtral #8802
- [Feature]: LoRA support for InternVLChatModel #9495
- LoRA for VLM2Vec
- Quantized multimodal models
- [VLM] Post-layernorm override and quant config in vision encoder #9217
- [Bugfix] Fix prefix strings for quantized VLMs #9772
- [Model] Add BNB quantization support for Mllama #9720
- [Bugfix][VLM] Make apply_fp8_linear work with >2D input #9812
- [Model] Support bitsandbytes for MiniCPMV #9891
- [Model] Support quantization of PixtralHFTransformer for PixtralHF #9921
- Refactor currently supported multimodal models for dynamic ViT&LM loading
- Enable LM-only loading for multimodal models that support embeddings as input
- Multimodal benchmarking (Online & Offline)
- PP for multimodal models
- Extra input mapper/processor kwargs
- [Core][Frontend] Support Passing Multimodal Processor Kwargs #8657
- [Model] Expose Phi3v num_crops as a mm_processor_kwarg #8658
- [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg #8946
- [Model] support input embeddings for qwen2vl #8856
- [Core][Frontend] Add Support for Inference Time mm_processor_kwargs #9131
- OOT multimodal models
Update [7/3] - We have finished our 2nd refactoring milestone - see details here.
Roadmap for 3rd Milestone
In the upcoming months, we will focus on wrapping up the main goal of this refactoring RFC and supporting more models and modalities.P0 (We will definitely work on these):
- Support image embeddings as input
- [Core][VLM] Support image embeddings as input #6613
- Support image embeddings for Fuyu and MiniCPM-V
- Support multiple multi-modal inputs whenever the model supports it (detailed plan)
- [VLM][Core] Support profiling with multiple multi-modal inputs per prompt #7126
- [Model] Add multi-image input support for LLaVA-Next offline inference #7230
- [Model][VLM] Support multi-images inputs for Phi-3-vision models #7783
- [Core][VLM] Stack multimodal tensors to represent multiple images within each prompt #7902
- [Model] Add Ultravox support for multiple audio chunks #7963
- Multi-image support for Chameleon & InternVL
- [Frontend][VLM] Add support for multiple multi-modal items in the OpenAI frontend #8049
- Merge at least 3 VLMs from the currently opened PRs
- [Model] Initial Support for Chameleon #5770, [VLM][Model] Support image input for Chameleon #6633
- [Model] Initial support for BLIP-2 #5920
- [Model] Adding support for MiniCPM-V #4087
- [Model] Initialize Fuyu-8B support #3924
- [Model] Initialize deepseek-vl support #5817
- [Model] Initialize support for InternVL2 series models #6514
- Better documentation
P1 (We should be aware of these and spend some time if possible):
- Aid support for Whisper with multimodal interface
- Custom vision prompt template in OpenAI-compatible server
- Sharding Vision Encoder & MultiModalProjector
- Bug Fixes
- Add more VLMs - See full List of vision models to implement
- Better error handling
P2 (We should work on these when they become more frequently requested) Help wanted!:
- Port over more vision encoders
- [Model] SiglipVisionModel ported from transformers #6942
- [Model]Refactor MiniCPMV #7020 (
Idefics2VisionTransformer
)
- Dynamic vision encoder and LM backbone
- VLM with Lora
- Quantized VLMs
- Add/aid support for models with other modalities
- Enable other features in vLLM with multi-modal models (e.g, chunked prefill, automatic prefix caching)
Update [6/11] - We have finished our 1st refactoring milestone - see details here.
Roadmap for 2nd Milestone
Some of the items @DarkLight1337, @xwjiang2010 and I are looking to work on as part of the next milestone are tentatively:API Changes: A list of user-facing breaking changes can be found here
- Completely remove the need for specifying image related arguments when launching the server, and infer configs from the model repo or a configmap in vLLM.
- Support dynamic image shape - This means the scheduler will need to know in advance the final shape of multi-modal embeddings that are processed right before being passed to the language model.
Performance related
- Port
CLIPVisionModel
- Optimize
CLIPAttention
- Optimize
MultiModalProjector
- Blocks: [CI/Build] Update CPU tests to include all "standard" tests #5481
Model support - Add more vision language models, and better developer facing documentation
Some of the ideas that we should work on in the future:
- Make VLMs work with chunked prefill
- Unify tokenizer & multi-modal processor (so that we can leverage
AutoProcessor
fromtransformers
) - Prefix caching for images
- Streaming inputs of multi-modal data
As always, please provide feedback and feature requests in this issue. Suggestions and contributions are very welcomed!
Original RFC
Multi-modality support was brought to vLLM recently, much thanks to https://github.com//pull/3042 from @xwjiang2010. Since then we have seen an increasing amount of interest in such models (from the number of pull requests and issues related). However, there are a few issues we should address with the current design before we bring in more features around multi-modality.-
VisionLanguageConfig
andMultiModalData
-
Currently the multimodal input can be either
pixel_values
orimage_feaures
for simplicity. While this works well with llava 1.5 where pixel_values are the only output from itsClipImageProcessor
, this does not work well when it comes to supporting models with more complicated preprocessing to return multiple outputs.(e.g, llava 1.6, fuyu, etc). Developers could add additional preprocessing inside model implementation as a workaround, but this will be unmaintainable over time. -
The overhead of requiring
image_feature_size
,image_token_id
andimage_input_shape
is pushed to the user when these can/should be inferred from the model & processor config and not required at the inference time.
-
-
The current design assumes multi-modal inputs are already processed to be consumed by the model executable, but vLLM does not have a processor util. This blocks the vision model support on the OpenAI API server for end-to-end inference.
-
The current prompt format
"<Image>" * 576 + prompt
makes the underlying implementation easier (especially when it comes to profiling), but complicates the user experience compared to huggingface format"<Image>\n" + prompt
and that has caused some confusion on what's needed to make multi-model work on vLLM.
Proposal
Most items in the above issues have been discussed and addressed in the original Llava1.5 PR as well as #3978. We propose a few high-level design decisions for the refactoring and welcome any feedback!
-
Adding a processor util - We can leverage out-of-box
AutoProcessor
fromtransformers
the same way we have been doing with tokenizer as an attribute ofLLMEngine
(e.g.,self.multi_modal_processor = AutoProcessor(model)
). This allows us to support end-to-end inference with the API server as well as theLLM
object. -
Frontend input format: Because of 1, we can keep the same format as HuggingFace since that's how users usually discover new models and it makes end-to-end integration test easier. Preprocessing should be hidden away from the interface and user. For example, this preprocessing step can be done inside
LLMEngine.add_request()
around the same place asvllm/vllm/engine/llm_engine.py
Lines 385 to 391 in a134ef6
if arrival_time is None: arrival_time = time.time() prompt_token_ids = self.encode_request( request_id=request_id, prompt=prompt, prompt_token_ids=prompt_token_ids, lora_request=lora_request)
Here's a pesudocode
if multi_modal_input is None:
prompt_token_ids = self.encode_request(
request_id=request_id,
prompt=prompt,
prompt_token_ids=prompt_token_ids,
lora_request=lora_request)
else:
# preprocessed_inputs is a dictionary of key(str)-value(tensor)
# as output of self.multi_modal_processor
preprocessed_inputs = self.preprocess_request(
request_id=request_id,
prompt=prompt,
prompt_token_ids=prompt_token_ids,
lora_request=lora_request,
multi_modal_input=images)
prompt_token_ids = preprocessed_inputs.pop("input_ids")
multi_modal_data = MultiModalData(data=preprocessed_inputs)
...
and thus at LLM
level, only image tensors will be required.
- Refactor
MultiModalData
: Now this object simply holds the multi-modal data dictionary that we need for the model_executable. At inference time, data is unpacked in the forward pass - this approach is similar totransformer
implementation of multi-modal models. - Refactor
VisionLanguageConfig
: This config is a lot simpler now. One caveat is that sometimes when the image features can be dynamic, users may specify an optionalmax_feature_size
to help engine run the profiling for the worst-case scenario as well as to potentially abort certain requests. - Regarding the original
image_feature
as input type design: IMO LlaVA is a special case among multi-modal models since its vision encoder is detached from the language model and can be initialized separately, but in this case, one could argue that for the MultiModalProjector as well, and perhaps passing image_feature (outputs of CLIP) is a design decision not generalizable to all other models. Instead, passing multi-modal embeddings (outputs of CLIP -> Projector) at inference time is more flexible and should work nicely with other models. (One followup question is, does it make sense to actually define a separateLlava-no-clip
module, since this is so specific to llava, to make our life easier?)
With the above changes, as an end-user, ideally you then should be able to do something like the following
from PIL import Image
from vllm import LLM
from vllm.config import VisionLanguageConfig
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
llm = LLM(model=model_id, multi_modal_input_type=VisionLanguageConfig.IMAGE_INPUT_TYPE.IMAGE) # This can also be EMBEDDINGS
prompt = "<image>\nUSER: What's the content of the image?\nASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
llm.generate(prompt, ..., multi_modal_input=image)
Under the hood, the pipeline is
prompt, image
-> prompt_token_ids, MultiModalData(data=preprocessed_inputs) # through preprocess within engine.add_request()
-> prompt_token_ids, pixel_values, image_sizes # though unpacking in implementation of model's `forward`.
I will follow up with a series of PR for refactoring but please leave any feedback since this is a pretty significant interface change.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status