-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][VLM] Support image embeddings as input #6613
[Core][VLM] Support image embeddings as input #6613
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
@Isotr0py It looks like the generation of image embedding from pixel values and merging with text embedding is currently tied together under |
@ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime) |
No rush at all, and thank you for helping out! |
@DarkLight1337 Please give this PR a first pass - I have updated all vision language models except two:
One observation is that we are able to use the same engine that's profiled with pixel value dummy data to support image embeddings as input since there should be less GPU memory usage when image embeddings are fed to the language model directly (No activation for ViT needed). We will add the support for initializing the language backbone only in a later PR, as the profiling will be different without the ViT. I plan to add a test for only Llava with image embedding as input since it's not worth testing this feature on all models when the logic is the same. On a side note, I've standardized the code organization across all files. Constants and input types will always be at the top, and the inference pipeline in the model image_input = self._parse_and_validate_image_input(**kwargs)
if image_input is not None:
vision_embeddings = self._process_image_input(image_input)
inputs_embeds = self.language_model.model.get_input_embeddings(
input_ids)
inputs_embeds = merge_vision_embeddings(
input_ids, inputs_embeds, vision_embeddings,
self.config.image_token_index)
input_ids = None
else:
inputs_embeds = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just need to add a test involving embeddings input.
The only small change I would make is to add an |
Oh, actually - we should update the input processors to allow embeddings input. It appears that you've only updated the one for LLaVA-NeXT. |
@DarkLight1337 This PR is ready for final review. I have added a test with Llava 1.5 and updated the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this!
@ywang96 @DarkLight1337 Thanks for the update! I checked the demo code, we still provide 2 modality separately, prompt and images, and the merge process is still controlled by only the VLLM supported VLM model, it is not that flexible if we wanna our own merge methods. Can we do as following: so we just take customized VLM as a PURE language model
I saw you also mentioned: Again, really really appreciated your help. |
Hey @Andcircle! Thanks for reaching out! Yes as you mentioned, what this PR does is to allow image embeddings as input so that users can process image to embeddings separately and won't need to load vision encoder and projector together with the language model in the same host (which is the TODO). We do have a PR from a community member regarding allow embeddings as input #6869. This touches a broader aspect of the framework that we're still evaluating, but perhaps we should prioritize it given this is such a popular request. I assume if we support this feature, then you will only need to initialize the language model itself (versus, initializing a VLM but only loading the language model). Please let me know if I missed anything! |
Yep! I totally agree with you that allowing embedding as input for LM is the most flexible. Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input. Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes! |
@ywang96 Sorry, naive question, what do you mean prompt prefix? the repeated wording for each prompt? |
@Andcircle Yes! For example, a long system prompt from the chat template that's shared by multiple requests. Feel free to read more here https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html |
@ywang96 |
@Isotr0py Hey do you think it makes sense to support image embeddings for Fuyu? (currently we cannot easily do it since the embedding creation is tied to tokenizer) We don't have to but I just want to get an opinion from you since you added the support for it. Thanks! |
@ywang96 Also for this highly customized pipeline, I think it's ok NOT have system prefix caching =) |
@ywang96 |
From the vLLM side, I suggest to modify |
That's exactly what I want to do. |
When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties. height, width = get_image_size(image, channel_dim=input_data_format) And here, if we just return image embeds, it will occur an error. Might we need to passthrough more parameters for qwen2vl? please me give some tips. |
Can you open a new issue for this? (since it's specific to Qwen2-VL) |
|
Signed-off-by: Alvant <alvasian@yandex.ru>
This PR adds the support for passing image embeddings as input so that they can be directly consumed by the language model.
Example usage
FIXES #6604
Follow-up TODO: Support initializing VLM with only language model backbone.
PR Checklist (Click to Expand)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!