-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]: Multi-modality Support Refactoring #4194
Comments
Thank you for kickstarting this conversation! Re: IssuesI fully agree with the issues which you have pointed out. I would like to add that the current prompt format is hardly extensible for multi-image input if we plan to pursue that further down the line. In #3978, I have proposed some ways of tackling the issue at the level of OpenAI-compatible server. I have thought about them more and have decided that they alone cannot provide the required flexibility, as explained below:
I am not confident that this assumption would hold for very long, given the fast-changing pace of the field.
I feel that this should be limited to cases where we only have to pass a single This is not to mention that you still have to manually duplicate the Re: ProposalsHere are my own thoughts on each proposal: 1. Adding a processor utilI think that we should move this responsibility outside of the 2. Frontend input formatMy comments on this are similar for Proposal 1. However, #4197 only refactors 3. Refactor
|
@ywang96 Thanks for driving the integration of more MM models into VLLM. 😍 It seems that there is no plan to refactor In my view, we should prioritize this, with performance being my main consideration. By refactoring the vision encoder, we can establish an integration standard for MM models, similar to the our LLM models integration . This will not only ensure inference performance but also provide integration guidelines for the community if I misunderstand, please correct me, thanks for your work again |
Generally, I agreed with @DarkLight1337's opinion about moving processing logics out from For example, |
cc @robertgshaw2-neuralmagic @mgoin (since NM's planned to work on whisper) Thank you all for the feedback so far! I plan to address feedback altogether after meeting up with the core devs as well as getting more perspectives from other community members who are working/plan to work on multi-modal models. Some quick ones that I can answer now:
@jeejeelee This will need to be done regardless since it's inside the model implementation, and this RFC is more around how we want to support multi-modal models in general, and thus focuses on the interface and component pattern.
@DarkLight1337 @Isotr0py If this is just about where the processor should live, I'm indifferent between having it live inside
@DarkLight1337 That's correct too, but I'm worried that as the model gets more and more complicated, this approach might not be generalizable. |
Since LLMEngine has support for an output processor interface, e.g. SequenceGroupOutputProcessor. Would it be reasonable within engine, to also add an InputProcessor interface? This way engine can check for existing of an input processor, but the implementation in this case for llava's single image processing can live outside of engine. It's implementation could be as suggested, based on AutoProcessor. As for supporting processing of something apart of an Image tag or varying formats - engine could only have a generic input processor executor, within the model executor's code, it would be up to the model implementation to define an input processor and pass it on to engine. |
@Isotr0py Perhaps we could follow a registry pattern and have each model separately register how to preprocess the inputs? If the model does not do so, then the default implementation would be to pass the data to HuggingFace processors. |
Yes, I agree that we can use processor registry to solve this. And it seems that |
I have added an implementation of the processor registry to #4197. Edit: I have also moved the specification of dummy data (for profiling) to the top-level registry. Each model can define its own dummy data by registering a factory function. |
To solve the prompt format problem for LLaVA, I think we have to also deal with generating the attention masks in the processing framework. That would mean abstracting some of the logic of |
Just a heads up that #4228 will introduce another vision language model to vLLM, so our discussion should take that into account as well. |
I discussed with @zhuohan123 offline about this - in particular regarding this comment
If vLLM's going to use out-of-box vllm/vllm/engine/llm_engine.py Lines 136 to 139 in 1543680
then at inference time, depending on if the request has multi-modal data or not, we process with it with either (IMO eventually, there really shouldn't be a separation between how we preprocess text data and multi-modal data as they should all go through one InputProcessor class, but that is probably a bigger engineering refactoring that we can leave for later.) We can also add an additional parameter on the engine level to indicate that we're feeding the engine an already processed dictionary of tensors, so the preprocessing step with @DarkLight1337 @Isotr0py WDYT? Do you see any issue with this design? |
This is somewhat similar to #4166 where I load the processing logic using I think one potential issue of this design is that the direct dependency on HuggingFace (which we have no control over) would complicate efforts to apply additional preprocessing specific to certain HuggingFace processors (e.g. to adapt to our interface). Since @Isotr0py 's comment, I have refactored the code in #4197 into using a registry pattern to apply the preprocessor, so that |
@DarkLight1337 Thanks for sharing the thoughts! @zhuohan123 and I actually discussed about the use of I think the point is that today The original design of the prompt interface isn't very clean, and is very specific to I will also be working on a PR so we can cross review each other's work. |
One thing to add is that we would like to keep vLLM's end-user API easy to use. Having |
In this case, we would have to refactor the computation of attention masks so that it can accept single |
Regarding #4228, I think there may be a situation that some MM models don't have a Processor implemented.
@DarkLight1337 IMO, there may be a solution that we can inherit and modify the LLaVA processor to handle num_features calculation and |
I like the idea of simply inheriting from the existing HuggingFace processor. How should we ensure that our implementation is loaded instead of the HuggingFace one? |
Also, I think that we should wrap the input prompt to Edit: Opened #4328 |
Edit: Nevermind, it's just a typo in the chat template I passed to the command for running the OpenAI-compatible server. To avoid such confusion in the future, I have opened #4292 to detect whether the string looks like a file path. |
I think we can refer to def get_processor(model: str,
model_type: str,
trust_remote_code: bool,
revision: Optional[str] = None,
code_revision: Optional[str] = None) -> ProcessorMixin:
if model_type in _PROCESSOR_REGISTRY:
processor_class = _PROCESSOR_REGISTRY[model_type]
processor = processor_class.from_pretrained(model,
revision=revision,
code_revision=code_revision)
return processor
try:
processor = AutoProcessor.from_pretrained(
model,
trust_remote_code=trust_remote_code,
revision=revision,
code_revision=code_revision)
except ValueError as e:
# do something else |
To be honest, I'm not a big fan of having to potentially add multiple files in different places* for each new model, but I guess that would work for now. Further down the line, we could consider adopting a more explicit interface for adding new models to vLLM. *Currently, we have to add a new file in |
Notably in that example, the first prompt has 2 images and the second prompt has 1 image. Our input mapper processes each prompt separately, so the image processor will output In this case, a list of two elements is inputted as
making the logic more akin to |
@ywang96 @DarkLight1337 I want to confirm again, is anyone currently starting work on VLM+LoRA ? If not, I'm willing to try. |
I don't think so - please feel free to work on it! |
Not that I'm aware of. Thanks for the help! |
Is anyone working on prefix caching on multimodality input? I just finished the video support and planning to start working on prefix caching. |
@TKONIY Not in my knowledge. Can you make a RFC issue about the high level design in your mind for discussion? For multimodal inputs, there are actually two possible layers of caching:
For 1, this is somewhat already addressed by #6613, since users then can implement their own embedding caching outside vLLM. For 2, you can read more about it here. The technical challenge here is that currently each block of KV cache is uniquely identified by the token(id)s within the block. However, for multimodal data, their representation will always be the placeholder token id in the original sequence. I think if we're able to address this problem and make it work with embedding based inputs, then this would benefit vLLM in a bigger scope if we decide to support embeddings as input for LMs eventually (i.e, #6869). |
Thanks for the introduction! In terms of the identifier, I will try to figure out a solution and open a RFC. |
With the release of v0.6.0, it's a good time now to wrap up the recent work on multi-modality! In the past two months, we have made tremendous progress in multi-modality! On behalf of the vLLM team, @DarkLight1337 and I would like to thank all the community members for their amazing contributions to this workstream! To summarize the update:
We're also very excited about the upcoming video support with dynamic number of frames (@TKONIY) and Qwen2-VL model support (@fyabc from Qwen Team) that will be available in 0.6.1 release! As usual, the roadmap for this workstream will be updated in the OP of this issue in the upcoming week. Feedbacks and contributions are always very welcomed! |
Multi-image/Video support for Qwenvl2 & InternVL2, Thank u! |
A friendly bump that our roadmap for multimodality has been updated in the OP of this thread! |
@ywang96 |
Looks like an incompatibility in |
|
Can you show the error in this case? |
|
I believe this is still some incompatibility issue, since Instead, you should ask LLaMA-Factory to support newer versions of vLLM. |
|
I don't quite get what you mean, how can you have different versions of torch for CPU and GPU at the same time? |
only cuda torch
If you internet is not good. You are so lucky. Because it will fail during the process of forcibly replacing CUDA torch with CPU. If you have a good internet connection. So things will become very bad. Your torch will transition from CUDA to a lower version CPU. |
What is your original version of pytorch? |
@DarkLight1337 torch Version: 2.5.0+cu124 |
Can you raise this in a new issue (with installation tag) so we can better focus on this? |
[Open issues - help wanted!]
Update [9/8] - We have finished majority of the refactoring and made extensive progress for supporting multimodal models. See details here.
In the upcoming months, we will focus on enabling multimodal models to be compatible with other performance-related features on vLLM as well as collaborating with model vendors to directly onboard new multimodal models.
P0 (We will definitely work on them):
P1 (We should be aware of these and spend some time if possible):
P2 (We should work on these when they become more important/frequently requested):
Update [7/3] - We have finished our 2nd refactoring milestone - see details here.
Roadmap for 3rd Milestone
In the upcoming months, we will focus on wrapping up the main goal of this refactoring RFC and supporting more models and modalities.P0 (We will definitely work on these):
P1 (We should be aware of these and spend some time if possible):
max_model_len
for multimodal models #7998max_num_batched_tokens
for multimodal models #8028max_num_batched_tokens
for multimodal models #8028P2 (We should work on these when they become more frequently requested) Help wanted!:
Idefics2VisionTransformer
)Update [6/11] - We have finished our 1st refactoring milestone - see details here.
Roadmap for 2nd Milestone
Some of the items @DarkLight1337, @xwjiang2010 and I are looking to work on as part of the next milestone are tentatively:API Changes: A list of user-facing breaking changes can be found here
image_input_type
from VLM config #5852Performance related
CLIPVisionModel
gelu
to CPU #5717CLIPAttention
MultiModalProjector
Model support - Add more vision language models, and better developer facing documentation
Some of the ideas that we should work on in the future:
AutoProcessor
fromtransformers
)As always, please provide feedback and feature requests in this issue. Suggestions and contributions are very welcomed!
Original RFC
Multi-modality support was brought to vLLM recently, much thanks to https://github.com//pull/3042 from @xwjiang2010. Since then we have seen an increasing amount of interest in such models (from the number of pull requests and issues related). However, there are a few issues we should address with the current design before we bring in more features around multi-modality.VisionLanguageConfig
andMultiModalData
Currently the multimodal input can be either
pixel_values
orimage_feaures
for simplicity. While this works well with llava 1.5 where pixel_values are the only output from itsClipImageProcessor
, this does not work well when it comes to supporting models with more complicated preprocessing to return multiple outputs.(e.g, llava 1.6, fuyu, etc). Developers could add additional preprocessing inside model implementation as a workaround, but this will be unmaintainable over time.The overhead of requiring
image_feature_size
,image_token_id
andimage_input_shape
is pushed to the user when these can/should be inferred from the model & processor config and not required at the inference time.The current design assumes multi-modal inputs are already processed to be consumed by the model executable, but vLLM does not have a processor util. This blocks the vision model support on the OpenAI API server for end-to-end inference.
The current prompt format
"<Image>" * 576 + prompt
makes the underlying implementation easier (especially when it comes to profiling), but complicates the user experience compared to huggingface format"<Image>\n" + prompt
and that has caused some confusion on what's needed to make multi-model work on vLLM.Proposal
Most items in the above issues have been discussed and addressed in the original Llava1.5 PR as well as #3978. We propose a few high-level design decisions for the refactoring and welcome any feedback!
Adding a processor util - We can leverage out-of-box
AutoProcessor
fromtransformers
the same way we have been doing with tokenizer as an attribute ofLLMEngine
(e.g.,self.multi_modal_processor = AutoProcessor(model)
). This allows us to support end-to-end inference with the API server as well as theLLM
object.Frontend input format: Because of 1, we can keep the same format as HuggingFace since that's how users usually discover new models and it makes end-to-end integration test easier. Preprocessing should be hidden away from the interface and user. For example, this preprocessing step can be done inside
LLMEngine.add_request()
around the same place asvllm/vllm/engine/llm_engine.py
Lines 385 to 391 in a134ef6
Here's a pesudocode
and thus at
LLM
level, only image tensors will be required.MultiModalData
: Now this object simply holds the multi-modal data dictionary that we need for the model_executable. At inference time, data is unpacked in the forward pass - this approach is similar totransformer
implementation of multi-modal models.VisionLanguageConfig
: This config is a lot simpler now. One caveat is that sometimes when the image features can be dynamic, users may specify an optionalmax_feature_size
to help engine run the profiling for the worst-case scenario as well as to potentially abort certain requests.image_feature
as input type design: IMO LlaVA is a special case among multi-modal models since its vision encoder is detached from the language model and can be initialized separately, but in this case, one could argue that for the MultiModalProjector as well, and perhaps passing image_feature (outputs of CLIP) is a design decision not generalizable to all other models. Instead, passing multi-modal embeddings (outputs of CLIP -> Projector) at inference time is more flexible and should work nicely with other models. (One followup question is, does it make sense to actually define a separateLlava-no-clip
module, since this is so specific to llava, to make our life easier?)With the above changes, as an end-user, ideally you then should be able to do something like the following
Under the hood, the pipeline is
I will follow up with a series of PR for refactoring but please leave any feedback since this is a pretty significant interface change.
The text was updated successfully, but these errors were encountered: