[Core][VLM] Support image embeddings as input #6613

ywang96 · 2024-07-21T04:42:04Z

This PR adds the support for passing image embeddings as input so that they can be directly consumed by the language model.

Example usage

# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

# Image embedding generated from a separate vision tower component, directly 
# used to be merged with text embedding.
image_embeds: torch.Tensor = ... # shape of (1, image_feature_size, hidden_size of LM)

# Single prompt inference
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image_embeds},
})

FIXES #6604

Follow-up TODO: Support initializing VLM with only language model backbone.

PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

[Bugfix] for bug fixes.
[CI/Build] for build or continuous integration improvements.
[Doc] for documentation fixes and improvements.
[Model] for adding a new model or improving an existing model. Model name should appear in the title.
[Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
[Kernel] for changes affecting CUDA kernels or other compute kernels.
[Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
[Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
[Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

We adhere to Google Python style guide and Google C++ style guide.
Pass all linter checks. Please use format.sh to format your code.
The code need to be well-documented to ensure future contributors can easily understand the code.
Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

github-actions · 2024-07-21T04:42:17Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

ywang96 · 2024-07-21T05:09:38Z

@Isotr0py It looks like the generation of image embedding from pixel values and merging with text embedding is currently tied together under Phi3HDImageEmbedding. Could you take a look to decouple them for the effort of this PR?

vllm/model_executor/models/llava_next.py

Isotr0py · 2024-07-21T07:01:15Z

@ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime)

ywang96 · 2024-07-21T07:24:12Z

@ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime)

No rush at all, and thank you for helping out!

ywang96 · 2024-08-09T06:59:31Z

@DarkLight1337 Please give this PR a first pass - I have updated all vision language models except two:

Chameleon (since the model itself is only input embedding based).
MiniCPMV (because the current implementation support multi-image inputs, IMO embedding support should be added after we support multi-image generally).

One observation is that we are able to use the same engine that's profiled with pixel value dummy data to support image embeddings as input since there should be less GPU memory usage when image embeddings are fed to the language model directly (No activation for ViT needed). We will add the support for initializing the language backbone only in a later PR, as the profiling will be different without the ViT.

I plan to add a test for only Llava with image embedding as input since it's not worth testing this feature on all models when the logic is the same.

On a side note, I've standardized the code organization across all files. Constants and input types will always be at the top, and the inference pipeline in the model forward will always follow the pattern below:

        image_input = self._parse_and_validate_image_input(**kwargs)

        if image_input is not None:
            vision_embeddings = self._process_image_input(image_input)
            inputs_embeds = self.language_model.model.get_input_embeddings(
                input_ids)

            inputs_embeds = merge_vision_embeddings(
                input_ids, inputs_embeds, vision_embeddings,
                self.config.image_token_index)

            input_ids = None
        else:
            inputs_embeds = None

DarkLight1337

LGTM. Just need to add a test involving embeddings input.

DarkLight1337 · 2024-08-09T07:10:49Z

The only small change I would make is to add an assert_never guard at the end of each _parse_and_validate_image_input function to make sure that we have handled all of the cases.

DarkLight1337 · 2024-08-09T07:14:08Z

Oh, actually - we should update the input processors to allow embeddings input. It appears that you've only updated the one for LLaVA-NeXT.

ywang96 · 2024-08-12T05:12:52Z

@DarkLight1337 This PR is ready for final review. I have added a test with Llava 1.5 and updated the documentation.

DarkLight1337

Thanks for implementing this!

Andcircle · 2024-08-22T17:42:50Z

Follow-up TODO: Support initializing VLM with only language model backbone.

@ywang96 @DarkLight1337 Thanks for the update!

I checked the demo code, we still provide 2 modality separately, prompt and images, and the merge process is still controlled by only the VLLM supported VLM model, it is not that flexible if we wanna our own merge methods.

Can we do as following: so we just take customized VLM as a PURE language model

#start from pure language model, NOT existing VLM
llm = LLM(model="mistral", ...)
#merge_inputs is a customized function, provided by user, this will make the process more flexible
inputs_embeds = merge_inputs(texts, images)
#the LLM only takes batch of merged embeddings, it doesn't care is it image / video / audio anymore, it just take it as pure language model
outputs = llm.generate(inputs_embeds=inputs_embeds, ...)

I saw you also mentioned:
Follow-up TODO: Support initializing VLM with only language model backbone.
Is this as above mentioned?

Again, really really appreciated your help.

ywang96 · 2024-08-22T18:06:35Z

Hey @Andcircle! Thanks for reaching out!

Yes as you mentioned, what this PR does is to allow image embeddings as input so that users can process image to embeddings separately and won't need to load vision encoder and projector together with the language model in the same host (which is the TODO).

We do have a PR from a community member regarding allow embeddings as input #6869. This touches a broader aspect of the framework that we're still evaluating, but perhaps we should prioritize it given this is such a popular request. I assume if we support this feature, then you will only need to initialize the language model itself (versus, initializing a VLM but only loading the language model).

Please let me know if I missed anything!

Andcircle · 2024-08-22T18:17:34Z

@ywang96
Thanks for your fast response!
Yes, I think #6869 should free us.

Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

ywang96 · 2024-08-22T18:21:12Z

@ywang96 Thanks for your fast response! Yes, I think #6869 should free us.

Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

Yep! I totally agree with you that allowing embedding as input for LM is the most flexible.

Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input.

Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes!

Andcircle · 2024-08-22T18:55:37Z

@ywang96 Thanks for your fast response! Yes, I think #6869 should free us.
Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

Yep! I totally agree with you that allowing embedding as input for LM is the most flexible.

Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input.

Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes!

@ywang96
Thanks for your explanation!

Sorry, naive question, what do you mean prompt prefix? the repeated wording for each prompt?

ywang96 · 2024-08-22T18:59:46Z

@ywang96 Thanks for your fast response! Yes, I think #6869 should free us.
Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

Yep! I totally agree with you that allowing embedding as input for LM is the most flexible.
Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input.
Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes!

@ywang96 Thanks for your explanation!

Sorry, naive question, what do you mean prompt prefix? the repeated wording for each prompt?

@Andcircle Yes! For example, a long system prompt from the chat template that's shared by multiple requests. Feel free to read more here https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html

Andcircle · 2024-08-22T20:33:34Z

@ywang96
Will have a look, thank =)

ywang96 · 2024-08-28T18:09:52Z

@Isotr0py Hey do you think it makes sense to support image embeddings for Fuyu? (currently we cannot easily do it since the embedding creation is tied to tokenizer) We don't have to but I just want to get an opinion from you since you added the support for it. Thanks!

Andcircle · 2024-08-28T20:32:04Z

@ywang96
It would be great to support fuyu style.
But if we can somehow let user provide embedding themselves like pseudo code above, the embedding is already decoupled from tokenizer, isn't it?

Also for this highly customized pipeline, I think it's ok NOT have system prefix caching =)

whyiug · 2024-09-24T07:16:34Z

@ywang96
This is a fantastic feature!
I've encountered a tricky problem.
I need to perform multiple VQA tasks (same image, different questions) using the same model architecture (e.g., Paligemma). In this case, I can use this feature to deploy image encoding and language model inference in different locations. This saves time by avoiding redundant image encoding for each language model inference.
However, how can I build an on-the-fly inference service (like a compatible server or something similar)? At this point, the solution I can think of is having the image encoding service write the encoded tensors to shared storage, while the language model inference service (vllm service) reads from the shared storage (which would involve modifying the vllm source code, building, and deploying).

DarkLight1337 · 2024-09-24T07:56:08Z

How can I build an on-the-fly inference service (like a compatible server or something similar)? At this point, the solution I can think of is having the image encoding service write the encoded tensors to shared storage, while the language model inference service (vllm service) reads from the shared storage (which would involve modifying the vllm source code, building, and deploying).

From the vLLM side, I suggest to modify vllm.multimodal.utils.get_and_parse_image and vllm.multimodal.utils.async_get_and_parse_image so that instead of the PIL image, they would fetch the embeddings of the image corresponding to the given URL. There should be no need to modify the rest of the code.

whyiug · 2024-09-24T08:24:14Z

How can I build an on-the-fly inference service (like a compatible server or something similar)? At this point, the solution I can think of is having the image encoding service write the encoded tensors to shared storage, while the language model inference service (vllm service) reads from the shared storage (which would involve modifying the vllm source code, building, and deploying).

From the vLLM side, I suggest to modify vllm.multimodal.utils.get_and_parse_image and vllm.multimodal.utils.async_get_and_parse_image so that instead of the PIL image, they would fetch the embeddings of the image corresponding to the given URL. There should be no need to modify the rest of the code.

That's exactly what I want to do.

whyiug · 2024-09-26T15:04:57Z

When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties.
For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here

    height, width = get_image_size(image, channel_dim=input_data_format)

And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.

Might we need to passthrough more parameters for qwen2vl? please me give some tips.

DarkLight1337 · 2024-09-26T15:06:27Z

When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties. For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here
    height, width = get_image_size(image, channel_dim=input_data_format)
And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.

Might we need to passthrough more parameters for qwen2vl? please me give some tips.

Can you open a new issue for this? (since it's specific to Qwen2-VL)

whyiug · 2024-09-26T15:33:54Z

When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties. For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here
    height, width = get_image_size(image, channel_dim=input_data_format)
And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.
Might we need to passthrough more parameters for qwen2vl? please me give some tips.
Can you open a new issue for this? (since it's specific to Qwen2-VL)
I've created a new issue and will be submitting the draft code.
#8857

Signed-off-by: Alvant <alvasian@yandex.ru>

iterate

365c245

DarkLight1337 self-assigned this Jul 21, 2024

ywang96 added 2 commits July 20, 2024 22:02

revert for phi3v

ecb99b3

revert

b06cd17

ywang96 mentioned this pull request Jul 21, 2024

[RFC]: Multi-modality Support on vLLM #4194

Open

71 tasks

DarkLight1337 reviewed Jul 21, 2024

View reviewed changes

vllm/model_executor/models/llava_next.py Outdated Show resolved Hide resolved

iterate

0ead298

Isotr0py mentioned this pull request Jul 21, 2024

[Model] Refactor and decouple phi3v image embedding #6621

Merged

ywang96 and others added 7 commits August 3, 2024 23:06

Merge branch 'main' into support-image-embed

693083a

Merge branch 'main' into support-image-embed

ff26774

Merge branch 'main' into support-image-embed

4c31750

format

356cbcc

update

94c9455

update

c6b43f8

update blip

28987a9

ywang96 marked this pull request as ready for review August 9, 2024 06:50

DarkLight1337 reviewed Aug 9, 2024

View reviewed changes

ywang96 added 6 commits August 9, 2024 00:40

rename variable

271be65

Merge branch 'main' into support-image-embed

038114b

fix yapf

3422dfc

allow embed inputs for clip & siglip

06f8f18

support phi3v embed input

ca99e19

revert fuyu embed input

c35004e

update doc

2f4539e

ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2024

revert fuyu changes

307d705

DarkLight1337 approved these changes Aug 12, 2024

View reviewed changes

DarkLight1337 merged commit e6e42e4 into vllm-project:main Aug 12, 2024
48 checks passed

sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024

[Core][VLM] Support image embeddings as input (vllm-project#6613)

e904c03

kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024

[Core][VLM] Support image embeddings as input (vllm-project#6613)

cb04d07

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024

[Core][VLM] Support image embeddings as input (vllm-project#6613)

9cc62ab

DarkLight1337 mentioned this pull request Aug 22, 2024

[Feature Request] Support input embedding in LLM.generate() #416

Open

DarkLight1337 mentioned this pull request Sep 26, 2024

[Model] support input embeddings for qwen2vl #8856

Merged

whyiug mentioned this pull request Sep 26, 2024

[Feature]: Support image embeddings as input for qwen2vl #8857

Closed

1 task

whyiug mentioned this pull request Oct 10, 2024

[Model] support input image embedding for minicpmv #9237

Merged

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Core][VLM] Support image embeddings as input (vllm-project#6613)

7701e55

Signed-off-by: Alvant <alvasian@yandex.ru>

KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024

[Core][VLM] Support image embeddings as input (vllm-project#6613)

efa865c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core][VLM] Support image embeddings as input #6613

[Core][VLM] Support image embeddings as input #6613

ywang96 commented Jul 21, 2024 •

edited

Loading

github-actions bot commented Jul 21, 2024

ywang96 commented Jul 21, 2024

Isotr0py commented Jul 21, 2024

ywang96 commented Jul 21, 2024

ywang96 commented Aug 9, 2024 •

edited

Loading

DarkLight1337 left a comment

DarkLight1337 commented Aug 9, 2024

DarkLight1337 commented Aug 9, 2024

ywang96 commented Aug 12, 2024

DarkLight1337 left a comment

Andcircle commented Aug 22, 2024

ywang96 commented Aug 22, 2024 •

edited

Loading

Andcircle commented Aug 22, 2024

ywang96 commented Aug 22, 2024 •

edited

Loading

Andcircle commented Aug 22, 2024

ywang96 commented Aug 22, 2024 •

edited

Loading

Andcircle commented Aug 22, 2024

ywang96 commented Aug 28, 2024 •

edited

Loading

Andcircle commented Aug 28, 2024

whyiug commented Sep 24, 2024

DarkLight1337 commented Sep 24, 2024 •

edited

Loading

whyiug commented Sep 24, 2024

whyiug commented Sep 26, 2024

DarkLight1337 commented Sep 26, 2024

whyiug commented Sep 26, 2024

[Core][VLM] Support image embeddings as input #6613

[Core][VLM] Support image embeddings as input #6613

Conversation

ywang96 commented Jul 21, 2024 • edited Loading

PR Title and Classification

Code Quality

Notes for Large Changes

What to Expect for the Reviews

Thank You

github-actions bot commented Jul 21, 2024

ywang96 commented Jul 21, 2024

Isotr0py commented Jul 21, 2024

ywang96 commented Jul 21, 2024

ywang96 commented Aug 9, 2024 • edited Loading

DarkLight1337 left a comment

Choose a reason for hiding this comment

DarkLight1337 commented Aug 9, 2024

DarkLight1337 commented Aug 9, 2024

ywang96 commented Aug 12, 2024

DarkLight1337 left a comment

Choose a reason for hiding this comment

Andcircle commented Aug 22, 2024

ywang96 commented Aug 22, 2024 • edited Loading

Andcircle commented Aug 22, 2024

ywang96 commented Aug 22, 2024 • edited Loading

Andcircle commented Aug 22, 2024

ywang96 commented Aug 22, 2024 • edited Loading

Andcircle commented Aug 22, 2024

ywang96 commented Aug 28, 2024 • edited Loading

Andcircle commented Aug 28, 2024

whyiug commented Sep 24, 2024

DarkLight1337 commented Sep 24, 2024 • edited Loading

whyiug commented Sep 24, 2024

whyiug commented Sep 26, 2024

DarkLight1337 commented Sep 26, 2024

whyiug commented Sep 26, 2024

ywang96 commented Jul 21, 2024 •

edited

Loading

ywang96 commented Aug 9, 2024 •

edited

Loading

ywang96 commented Aug 22, 2024 •

edited

Loading

ywang96 commented Aug 22, 2024 •

edited

Loading

ywang96 commented Aug 22, 2024 •

edited

Loading

ywang96 commented Aug 28, 2024 •

edited

Loading

DarkLight1337 commented Sep 24, 2024 •

edited

Loading