-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Add vision language model support. #3042
Conversation
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>
# encoding. | ||
# Each request should have at least `image_feature_size` tokens. | ||
if self.vision_language_config: | ||
max_num_seqs = min( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently don't understand this part, can you make the comments above clearer? (and also move them into the "if" condition) :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rephrased. ptal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps add a warning here since this will be "overriding" user configurations.
@pcmoritz Please let me know when you finish the first pass on this PR and when can I start reviewing! |
I think this PR looks very promising! I think it would be a good idea to implement the vision tower using the vLLM primitives, so that it can:
Additionally, the other note I had is that it is somewhat hard to follow what the datatype of the image inputs should be since they are passed around as raw torch tensors. It might be nice to make a datatype (even if they are just aliases of torch tensors, that make it more explicit that it is either pixel values or embedding values. This would make the code more readable since this was confusing to me at first Note, we are working on encoder-decoder (to enable whisper). We will use a similar structure for the whisper multimodality as you have here for llava. |
It looks good overall. I have a few suggestions:
|
@xwjiang2010 I executed this code
and encountered an error message: return F.linear(input, self.weight, self.bias) |
@zhuohan123 @pcmoritz
@robertgshaw2-neuralmagic I think this is a great feedback and also echoed by @Pernekhan. We should do that! |
Yes, we want to first get a very simple implementation of the vision tower in before we do something more advanced. We can implement the vision model with vLLM primitives as a follow up later if it is worth the complexity (but we should do benchmarks first before we do that to ensure it will be worth the additional complexity). If there are contributions towards this effort, that would certainly speed things up (either implementations or benchmarking). |
+1, let's merge a simple version where we don't maintain the vision model code by ourselves first. We can optimize the performance later. |
@Pernekhan
Since this is API discussion, I think we should align asap.
|
@junior-zsy It's likely the |
I guess it should be like this:
|
Yes, you are exactly right. Did the snippet work for you? |
Yes, it worked, thank you! |
tests/conftest.py
Outdated
from vllm.transformers_utils.tokenizer import get_tokenizer | ||
|
||
_TEST_DIR = os.path.dirname(__file__) | ||
_TEST_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "example.txt")] | ||
_LONG_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "summary.txt")] | ||
|
||
_PIXEL_VALUES_FILES = [ | ||
"images/stop_sign_pixel_values.pt", "images/cherry_blossom_pixel_values.pt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we generate these programmatically from the .jpg
files and not check them in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment makes me think I need to document what line of code output do pixel_values and image_features correspond to.
As to a programmatically way of generating these, while pixel_values is easy to do, image_features is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added more comments under llava.py
.
vllm/model_executor/models/llava.py
Outdated
hf_vision_config = config.vision_config | ||
self.vision_language_config = vision_language_config | ||
|
||
assert self.vision_language_config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't fail if the type signature above is correct :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not rely on type hinting. I added some useful user facing information that will show up when someone does
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
)
instead of
llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
image_token_id=32000,
image_input_shape=(1, 3, 336, 336),
image_feature_size=576,
)
I will be working on a PR for Llava 1.6 - ideally by the end of this week |
@ywang96 Amazing! |
@ywang96 I am doing a bit of POC on Llava 1.6. There should be no major blocker than dynamically figuring the number of |
thank you for your great work ! |
Hi @xwjiang2010 Can I use my fine-tuned llava on VLLM? I'm first downloading my fine-tuned model from HF than in the LLM class I'm doing
|
@ywang96 thank you for the work you're doing, I can't wait to see the results! |
@xwjiang2010 I am working on developing an OpenAI-compatible server for LLaVa #3873 and have encountered a couple of points where I seek your guidance and wish to offer some suggestions.
To enhance user convenience, I propose the integration of a feature that automates the conversion of raw image files into image features, eliminating the need for users to manually prepare .pt files for utilizing LLaVa with vLLM. This enhancement would align the process more closely with the OpenAI API's format, which accepts images via URL or base64-encoded local files in formats such as PNG, JPEG, WEBP, and GIF. Thank you for considering these points. I am eager to hear your thoughts and look forward to continuing to leverage the impressive capabilities of your work. Best regards, |
You have to set the
I would like to add to your point. The current example script requires the use of S3 which is not convenient to set up. While developing support for OpenAI image input API, I personally passed URLs to online images for testing. Perhaps the example should be modified later so that S3 is no longer required. |
@alsichcan I personally agree with your point. That's why I've been taking time to think about the best way to put such helper module in vllm and integrate it with the current vision language model framework, and this can also be the module to bridge the engine and the API server if we eventually build image API into it as well. |
@WoosukKwon I think you should close #1286 and #1751 as well, since they have been resolved by this PR. |
@DarkLight1337 @alsichcan FYI - While working on adding support for Llava-Next, I realize the current design for vision models is too specific to Llava1.5 and probably not generalizable to support other multi-modal models, along with things that are missing to support end-to-end inference with the API server that has been addressed in #3978. I'm working on a RFC to share some thoughts for refactoring and will send out tomorrow. |
Hey @ywang96 Can I use my fine-tuned PEFT LLava model with vllm? I'm writing a notebook for Brev that I want to share with the world, but I'm stuck in this problem, Can you please help me out? here is the fine-tuned model on HuggingFace marksuccsmfewercoc/llava-1.5-7b-hf-ft-mix-vsft |
I tried using the llava_example.py from [https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html] but encountering I pip installed vllm version 0.4.3 Anyone know what's the issue? |
You are using the docs for |
This is a very interesting job, but I have two questions here. I hope the author can answer:
|
|
Vision Language Support
This PR adds vision language support to vLLM.
Mainly API changes. The core logic of vLLM is kept untouched.
The design goal is to enable all vision language models although the POC is done using Llava-7b.
Usage
The usage looks like this:
Feature list
Reviewability
The PR should work end to end. I have tested it locally through
test_llava.py
, which is a correctness test I added that compares transformers’ result and vLLM’s result.Depending on vLLM team’s preference, we can either use this PR, in which case I need some more work to fix CI failures. Or I can break it down into smaller PRs to facilitate review.
Future work