Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core][VLM] Support image embeddings as input #6613

Merged
merged 30 commits into from
Aug 12, 2024

Conversation

ywang96
Copy link
Member

@ywang96 ywang96 commented Jul 21, 2024

This PR adds the support for passing image embeddings as input so that they can be directly consumed by the language model.

Example usage

# Refer to the HuggingFace repo for the correct format to use
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

# Image embedding generated from a separate vision tower component, directly 
# used to be merged with text embedding.
image_embeds: torch.Tensor = ... # shape of (1, image_feature_size, hidden_size of LM)

# Single prompt inference
outputs = llm.generate({
    "prompt": prompt,
    "multi_modal_data": {"image": image_embeds},
})

FIXES #6604

Follow-up TODO: Support initializing VLM with only language model backbone.


PR Checklist (Click to Expand)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337 DarkLight1337 self-assigned this Jul 21, 2024
@ywang96
Copy link
Member Author

ywang96 commented Jul 21, 2024

@Isotr0py It looks like the generation of image embedding from pixel values and merging with text embedding is currently tied together under Phi3HDImageEmbedding. Could you take a look to decouple them for the effort of this PR?

@Isotr0py
Copy link
Collaborator

@ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime)

@ywang96
Copy link
Member Author

ywang96 commented Jul 21, 2024

@ywang96 Ok, I will decouple them tonight. (Sorry that I don't have bandwidth at daytime)

No rush at all, and thank you for helping out!

@ywang96 ywang96 marked this pull request as ready for review August 9, 2024 06:50
@ywang96
Copy link
Member Author

ywang96 commented Aug 9, 2024

@DarkLight1337 Please give this PR a first pass - I have updated all vision language models except two:

  • Chameleon (since the model itself is only input embedding based).
  • MiniCPMV (because the current implementation support multi-image inputs, IMO embedding support should be added after we support multi-image generally).

One observation is that we are able to use the same engine that's profiled with pixel value dummy data to support image embeddings as input since there should be less GPU memory usage when image embeddings are fed to the language model directly (No activation for ViT needed). We will add the support for initializing the language backbone only in a later PR, as the profiling will be different without the ViT.

I plan to add a test for only Llava with image embedding as input since it's not worth testing this feature on all models when the logic is the same.

On a side note, I've standardized the code organization across all files. Constants and input types will always be at the top, and the inference pipeline in the model forward will always follow the pattern below:

        image_input = self._parse_and_validate_image_input(**kwargs)

        if image_input is not None:
            vision_embeddings = self._process_image_input(image_input)
            inputs_embeds = self.language_model.model.get_input_embeddings(
                input_ids)

            inputs_embeds = merge_vision_embeddings(
                input_ids, inputs_embeds, vision_embeddings,
                self.config.image_token_index)

            input_ids = None
        else:
            inputs_embeds = None

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just need to add a test involving embeddings input.

@DarkLight1337
Copy link
Member

The only small change I would make is to add an assert_never guard at the end of each _parse_and_validate_image_input function to make sure that we have handled all of the cases.

@DarkLight1337
Copy link
Member

Oh, actually - we should update the input processors to allow embeddings input. It appears that you've only updated the one for LLaVA-NeXT.

@ywang96 ywang96 added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 12, 2024
@ywang96
Copy link
Member Author

ywang96 commented Aug 12, 2024

@DarkLight1337 This PR is ready for final review. I have added a test with Llava 1.5 and updated the documentation.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this!

@DarkLight1337 DarkLight1337 merged commit e6e42e4 into vllm-project:main Aug 12, 2024
48 checks passed
sfc-gh-mkeralapura pushed a commit to sfc-gh-mkeralapura/vllm that referenced this pull request Aug 12, 2024
kylesayrs pushed a commit to neuralmagic/vllm that referenced this pull request Aug 17, 2024
fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Aug 22, 2024
@Andcircle
Copy link

Follow-up TODO: Support initializing VLM with only language model backbone.

@ywang96 @DarkLight1337 Thanks for the update!

I checked the demo code, we still provide 2 modality separately, prompt and images, and the merge process is still controlled by only the VLLM supported VLM model, it is not that flexible if we wanna our own merge methods.

Can we do as following: so we just take customized VLM as a PURE language model

#start from pure language model, NOT existing VLM
llm = LLM(model="mistral", ...)
#merge_inputs is a customized function, provided by user, this will make the process more flexible
inputs_embeds = merge_inputs(texts, images)
#the LLM only takes batch of merged embeddings, it doesn't care is it image / video / audio anymore, it just take it as pure language model
outputs = llm.generate(inputs_embeds=inputs_embeds, ...)

I saw you also mentioned:
Follow-up TODO: Support initializing VLM with only language model backbone.
Is this as above mentioned?

Again, really really appreciated your help.

@ywang96
Copy link
Member Author

ywang96 commented Aug 22, 2024

Hey @Andcircle! Thanks for reaching out!

Yes as you mentioned, what this PR does is to allow image embeddings as input so that users can process image to embeddings separately and won't need to load vision encoder and projector together with the language model in the same host (which is the TODO).

We do have a PR from a community member regarding allow embeddings as input #6869. This touches a broader aspect of the framework that we're still evaluating, but perhaps we should prioritize it given this is such a popular request. I assume if we support this feature, then you will only need to initialize the language model itself (versus, initializing a VLM but only loading the language model).

Please let me know if I missed anything!

@Andcircle
Copy link

@ywang96
Thanks for your fast response!
Yes, I think #6869 should free us.

Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

@ywang96
Copy link
Member Author

ywang96 commented Aug 22, 2024

@ywang96 Thanks for your fast response! Yes, I think #6869 should free us.

Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

Yep! I totally agree with you that allowing embedding as input for LM is the most flexible.

Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input.

Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes!

@Andcircle
Copy link

@ywang96 Thanks for your fast response! Yes, I think #6869 should free us.
Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

Yep! I totally agree with you that allowing embedding as input for LM is the most flexible.

Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input.

Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes!

@ywang96
Thanks for your explanation!

Sorry, naive question, what do you mean prompt prefix? the repeated wording for each prompt?

@ywang96
Copy link
Member Author

ywang96 commented Aug 22, 2024

@ywang96 Thanks for your fast response! Yes, I think #6869 should free us.
Just to be clarified, #6869 's use case can be much broader =) not only for embedding tuning as stated, It can used for VLM research in a very flexible manner.

Yep! I totally agree with you that allowing embedding as input for LM is the most flexible.
Just to be transparent, there are a few considerations on why we haven't supported it; for example, how to make prefix caching work with embeddings as input? Today we use prompt token ids as the hash/identifier of the prompt prefix, so this feature wouldn't work at all for embeddings as input.
Anyways, perhaps we could pursue for enabling it as an experimental feature and see how it goes!

@ywang96 Thanks for your explanation!

Sorry, naive question, what do you mean prompt prefix? the repeated wording for each prompt?

@Andcircle Yes! For example, a long system prompt from the chat template that's shared by multiple requests. Feel free to read more here https://docs.vllm.ai/en/latest/automatic_prefix_caching/apc.html

@Andcircle
Copy link

@ywang96
Will have a look, thank =)

@ywang96
Copy link
Member Author

ywang96 commented Aug 28, 2024

@Isotr0py Hey do you think it makes sense to support image embeddings for Fuyu? (currently we cannot easily do it since the embedding creation is tied to tokenizer) We don't have to but I just want to get an opinion from you since you added the support for it. Thanks!

@Andcircle
Copy link

@ywang96
It would be great to support fuyu style.
But if we can somehow let user provide embedding themselves like pseudo code above, the embedding is already decoupled from tokenizer, isn't it?

Also for this highly customized pipeline, I think it's ok NOT have system prefix caching =)

@whyiug
Copy link
Contributor

whyiug commented Sep 24, 2024

@ywang96
This is a fantastic feature!
I've encountered a tricky problem.
I need to perform multiple VQA tasks (same image, different questions) using the same model architecture (e.g., Paligemma). In this case, I can use this feature to deploy image encoding and language model inference in different locations. This saves time by avoiding redundant image encoding for each language model inference.
However, how can I build an on-the-fly inference service (like a compatible server or something similar)? At this point, the solution I can think of is having the image encoding service write the encoded tensors to shared storage, while the language model inference service (vllm service) reads from the shared storage (which would involve modifying the vllm source code, building, and deploying).

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 24, 2024

How can I build an on-the-fly inference service (like a compatible server or something similar)? At this point, the solution I can think of is having the image encoding service write the encoded tensors to shared storage, while the language model inference service (vllm service) reads from the shared storage (which would involve modifying the vllm source code, building, and deploying).

From the vLLM side, I suggest to modify vllm.multimodal.utils.get_and_parse_image and vllm.multimodal.utils.async_get_and_parse_image so that instead of the PIL image, they would fetch the embeddings of the image corresponding to the given URL. There should be no need to modify the rest of the code.

@whyiug
Copy link
Contributor

whyiug commented Sep 24, 2024

How can I build an on-the-fly inference service (like a compatible server or something similar)? At this point, the solution I can think of is having the image encoding service write the encoded tensors to shared storage, while the language model inference service (vllm service) reads from the shared storage (which would involve modifying the vllm source code, building, and deploying).

From the vLLM side, I suggest to modify vllm.multimodal.utils.get_and_parse_image and vllm.multimodal.utils.async_get_and_parse_image so that instead of the PIL image, they would fetch the embeddings of the image corresponding to the given URL. There should be no need to modify the rest of the code.

That's exactly what I want to do.

@whyiug
Copy link
Contributor

whyiug commented Sep 26, 2024

When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties.
For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here

    height, width = get_image_size(image, channel_dim=input_data_format)

And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.

Might we need to passthrough more parameters for qwen2vl? please me give some tips.

@DarkLight1337
Copy link
Member

When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties. For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here

    height, width = get_image_size(image, channel_dim=input_data_format)

And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.

Might we need to passthrough more parameters for qwen2vl? please me give some tips.

Can you open a new issue for this? (since it's specific to Qwen2-VL)

@whyiug
Copy link
Contributor

whyiug commented Sep 26, 2024

When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties. For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here

    height, width = get_image_size(image, channel_dim=input_data_format)

And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.
Might we need to passthrough more parameters for qwen2vl? please me give some tips.

Can you open a new issue for this? (since it's specific to Qwen2-VL)
I've created a new issue and will be submitting the draft code.
#8857

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
KuntaiDu pushed a commit to KuntaiDu/vllm that referenced this pull request Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature]: MultiModal LLM with vector API
5 participants