[Feature] Add vision language model support. #3042

xwjiang2010 · 2024-02-26T21:34:36Z

Vision Language Support

This PR adds vision language support to vLLM.
Mainly API changes. The core logic of vLLM is kept untouched.

The design goal is to enable all vision language models although the POC is done using Llava-7b.

Usage

The usage looks like this:

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type="pixel_values",
    image_token_id=32000,
    image_input_shape="1,3,336,336",
    image_feature_size=576,
)

prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load("xxx")  # This should be generated offline or by another online component. See tests/images/ for samples

from vllm.sequence import MultiModalData
llm.generate(prompt, multi_modal_data=MultiModalData(type=MultiModalData.Type.IMAGE, data=images))

Feature list

Allow vLLM entrypoint to take image as input.
Expand SequenceGroup and SequenceGroupMetadata’s API to take image.
Expand vLLM engine to take image.
Expand the contract between engine and worker to include image (only for prompting phase).
Add Llava model.
Support hosting vision tower inside or leave it outside of vLLM. This allows for maximum flexibility and configurability when balancing scalability and latency. This is configured through VisionLanguageConfig.
Works with other vLLM features: TP > 1, preemption, cuda graph etc.

Reviewability

The PR should work end to end. I have tested it locally through test_llava.py, which is a correctness test I added that compares transformers’ result and vLLM’s result.
Depending on vLLM team’s preference, we can either use this PR, in which case I need some more work to fix CI failures. Or I can break it down into smaller PRs to facilitate review.

Future work

Benchmarking of both FTL (first token latency) and ITL (inter token latency).
Avoid inefficiency involved using transformers native implementation of vision tower.
Scalability studies of models of varying sizes

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>

pcmoritz · 2024-02-29T20:45:26Z

vllm/worker/model_runner.py

+        # encoding.
+        # Each request should have at least `image_feature_size` tokens.
+        if self.vision_language_config:
+            max_num_seqs = min(


I currently don't understand this part, can you make the comments above clearer? (and also move them into the "if" condition) :)

rephrased. ptal.

Perhaps add a warning here since this will be "overriding" user configurations.

zhuohan123 · 2024-03-03T23:59:04Z

@pcmoritz Please let me know when you finish the first pass on this PR and when can I start reviewing!

robertgshaw2-redhat · 2024-03-05T03:35:46Z

I think this PR looks very promising!

I think it would be a good idea to implement the vision tower using the vLLM primitives, so that it can:

use tensor parallelism (I think right now the model is replicated on all layers)
use the inference only kernels

Additionally, the other note I had is that it is somewhat hard to follow what the datatype of the image inputs should be since they are passed around as raw torch tensors. It might be nice to make a datatype (even if they are just aliases of torch tensors, that make it more explicit that it is either pixel values or embedding values. This would make the code more readable since this was confusing to me at first

Note, we are working on encoder-decoder (to enable whisper). We will use a similar structure for the whisper multimodality as you have here for llava.

Pernekhan · 2024-03-08T01:36:36Z

It looks good overall.

I have a few suggestions:

I believe we don't need VisionLanguageConfig. The fields could be derived from the HF config.
- image_feature_size could be derived by (image_size * image_size) / (patch_size * patch_size) (i.e. how many patch_size tiles are needed to cover it. For Llava1.5 it's (336 * 336) / (14 * 14) = 576.
- image_token_id could be used from the HF config image_token_index
- image_input_shape also makes it very restrictive, as Llava 1.6 has multiple image sizes.
- image_input_type could always be pixel_values for the simplicity
Consider making the argument image_request more generic to allow other multi-modality in the future. I propose instead of image_request we have a datatype with type and data fields in it. So that in the future other multimodal functionalities don't need to introduce a new arguments each time.

junior-zsy · 2024-03-08T10:45:32Z

@xwjiang2010 I executed this code

 llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type="pixel_values",
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336),
    image_feature_size=576,
)

prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load("xxx")  # This should be generated offline or by another online component. See tests/images/ for samples

llm.generate(prompt, images=images)

and encountered an error message: return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 shapes cannot be multiplied (7056x336 and 1024x4096)

xwjiang2010 · 2024-03-08T17:19:18Z

I think it would be a good idea to implement the vision tower using the vLLM primitives, so that it can:

use tensor parallelism (I think right now the model is replicated on all layers)

use the inference only kernels

@zhuohan123 @pcmoritz
I remember you prefer using hf's implementation for vision related stuff last time we chatted. Could you clarify vLLM's position on this again? I am happy either way and it's more about complexity and maintainability of the repo. So you guys' call.

Additionally, the other note I had is that it is somewhat hard to follow what the datatype of the image inputs should be since they are passed around as raw torch tensors. It might be nice to make a datatype (even if they are just aliases of torch tensors, that make it more explicit that it is either pixel values or embedding values. This would make the code more readable since this was confusing to me at first

@robertgshaw2-neuralmagic I think this is a great feedback and also echoed by @Pernekhan. We should do that!

pcmoritz · 2024-03-08T17:34:08Z

Yes, we want to first get a very simple implementation of the vision tower in before we do something more advanced. We can implement the vision model with vLLM primitives as a follow up later if it is worth the complexity (but we should do benchmarks first before we do that to ensure it will be worth the additional complexity). If there are contributions towards this effort, that would certainly speed things up (either implementations or benchmarking).

zhuohan123 · 2024-03-08T18:42:31Z

Yes, we want to first get a very simple implementation of the vision tower in before we do something more advanced. We can implement the vision model with vLLM primitives as a follow up later if it is worth the complexity (but we should do benchmarks first before we do that to ensure it will be worth the additional complexity). If there are contributions towards this effort, that would certainly speed things up (either implementations or benchmarking).

+1, let's merge a simple version where we don't maintain the vision model code by ourselves first. We can optimize the performance later.

xwjiang2010 · 2024-03-09T02:42:58Z

@Pernekhan
Thanks for the comment.

I believe we don't need VisionLanguageConfig. The fields could be derived from the HF config.

image_feature_size could be derived by (image_size * image_size) / (patch_size * patch_size) (i.e. how many patch_size tiles are needed to cover it. For Llava1.5 it's (336 * 336) / (14 * 14) = 576.

image_token_id could be used from the HF config image_token_index

image_input_shape also makes it very restrictive, as Llava 1.6 has multiple image sizes.

image_input_type could always be pixel_values for the simplicity

Since this is API discussion, I think we should align asap.
While I agree with you that some information can be inferred from hf_config, I still think we should be explicit. The reasons are:

How hf_config maps to these configs is model dependent. Managing the mapping for various models should not be vLLM's responsibility.
image_input_shape: what we really need is the maximum input shape in the worst case. This is used for dry run to determine how many GPU blocks cache manager can have.
pixel_values is indeed the default. However, as model gets larger, it's likely that one may want to have the encoder run somewhere else. To my knowledge, some people request to have the flexibility of feeding in features or even joint embeddings directly.

xwjiang2010 · 2024-03-09T02:43:51Z

@junior-zsy It's likely the images arg not being quite right. Can you compare with the pytest fixture used by test_llava.py?

alexv-cerebras · 2024-03-11T08:11:56Z

I guess it should be like this:

from vllm import LLM
from vllm.config import VisionLanguageConfig

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_feature_size=576,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336)
)
prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load(xxx)  # This should be generated offline or by another online component. See tests/images/ for samples

output = llm.generate(prompt, image_request=images)

xwjiang2010 · 2024-03-11T15:30:18Z

I guess it should be like this:

from vllm import LLM
from vllm.config import VisionLanguageConfig

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_feature_size=576,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336)
)
prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load(xxx)  # This should be generated offline or by another online component. See tests/images/ for samples

output = llm.generate(prompt, image_request=images)

Yes, you are exactly right. Did the snippet work for you?

alexv-cerebras · 2024-03-11T17:12:01Z

I guess it should be like this:

from vllm import LLM
from vllm.config import VisionLanguageConfig

llm = LLM(
    model="llava-hf/llava-1.5-7b-hf",
    image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
    image_feature_size=576,
    image_token_id=32000,
    image_input_shape=(1, 3, 336, 336)
)
prompt = "<image>" * 576 + "What is the content of this image?"

images=torch.load(xxx)  # This should be generated offline or by another online component. See tests/images/ for samples

output = llm.generate(prompt, image_request=images)

Yes, you are exactly right. Did the snippet work for you?

Yes, it worked, thank you!

pcmoritz · 2024-03-12T21:21:43Z

tests/conftest.py

 from vllm.transformers_utils.tokenizer import get_tokenizer

 _TEST_DIR = os.path.dirname(__file__)
 _TEST_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "example.txt")]
 _LONG_PROMPTS = [os.path.join(_TEST_DIR, "prompts", "summary.txt")]

+_PIXEL_VALUES_FILES = [
+    "images/stop_sign_pixel_values.pt", "images/cherry_blossom_pixel_values.pt"


Can we generate these programmatically from the .jpg files and not check them in?

Your comment makes me think I need to document what line of code output do pixel_values and image_features correspond to.
As to a programmatically way of generating these, while pixel_values is easy to do, image_features is not.

Added more comments under llava.py.

tests/conftest.py

tests/models/test_llava.py

vllm/transformers_utils/config.py

vllm/model_executor/models/llava.py

pcmoritz · 2024-03-12T23:16:41Z

vllm/model_executor/models/llava.py

+        hf_vision_config = config.vision_config
+        self.vision_language_config = vision_language_config
+
+        assert self.vision_language_config


This can't fail if the type signature above is correct :)

I would rather not rely on type hinting. I added some useful user facing information that will show up when someone does

llm = LLM( model="llava-hf/llava-1.5-7b-hf", )

instead of

llm = LLM( model="llava-hf/llava-1.5-7b-hf", image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES, image_token_id=32000, image_input_shape=(1, 3, 336, 336), image_feature_size=576, )

ywang96 · 2024-04-01T18:01:56Z

@xwjiang2010 is there a plan to update this for llava 1.6? 1.6 is vastly better than 1.5 in terms of accuracy.

I tried using sglang from llava repo, and hit tons of problems, hoping vLLM team can make it work for fast concurrent inference!

I will be working on a PR for Llava 1.6 - ideally by the end of this week

pseudotensor · 2024-04-01T18:02:32Z

@ywang96 Amazing!

xwjiang2010 · 2024-04-01T18:17:28Z

@ywang96 I am doing a bit of POC on Llava 1.6. There should be no major blocker than dynamically figuring the number of <image> placeholders. There are other nits and bits on performance implications which I hope can be solved in a similar fashion as disaggregated prefill.
I am happy to chat offline and totally agree that 1.6 support is essential for unlocking a ton of exciting applications!

lightmatmul · 2024-04-02T23:22:06Z

thank you for your great work !
I was wondering what would be the difference between images features and pixel values ? at least performance wise ?

Iven2132 · 2024-04-03T07:56:44Z

Hi @xwjiang2010 Can I use my fine-tuned llava on VLLM? I'm first downloading my fine-tuned model from HF than in the LLM class I'm doing model="llava-hf/llava-1.5-7b-hf" am I doing things correctly? Can you please help me?

import os
import requests
from modal import Image, Secret, Stub, enter, exit, gpu, method
import subprocess

MODEL_DIR = "/model"
BASE_MODEL = "myfine-tuned-model-huggingface-repo"

def download_model_and_image():
    from huggingface_hub import snapshot_download
    from transformers.utils import move_cache

    snapshot_download(
        BASE_MODEL,
        local_dir=MODEL_DIR,
        token=os.environ["HF_TOKEN"],
        ignore_patterns=["*.pt", "*.gguf"],
    )
    move_cache()

image = (
    Image.from_registry(
        "nvidia/cuda:12.1.1-devel-ubuntu22.04", add_python="3.10"
    )
    .pip_install(
        "vllm==0.4.0",
        "huggingface_hub==0.19.4",
        "hf-transfer==0.1.4",
        "torch==2.1.2",
        "aws-shell",
        "requests" 
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
    .run_function(
        download_model_and_image,
        secrets=[Secret.from_name("huggingface-secret")],
        timeout=60 * 20,
    )
)

stub = Stub(f"example-vllm", image=image)

GPU_CONFIG = gpu.A100(memory=80, count=1)

@stub.cls(gpu=GPU_CONFIG, secrets=[Secret.from_name("huggingface-secret")])
class Model:
    @enter()
    def load(self):
        from vllm import LLM

        if GPU_CONFIG.count > 1:
            import ray

            ray.shutdown()
            ray.init(num_gpus=GPU_CONFIG.count)

        self.llm = LLM(
            model="llava-hf/llava-1.5-7b-hf",
            image_input_type="image_features",
            image_token_id=32000,
            image_input_shape="1,576,1024",
            image_feature_size=576,
        )

        s3_bucket_path = "s3://air-example-data-2/vllm_opensource_llava/"
        local_directory = "images"
        os.makedirs(local_directory, exist_ok=True)
        subprocess.check_call([
            "aws",
            "s3",
            "sync",
            s3_bucket_path,
            local_directory,
            "--no-sign-request",
        ])


    @method()
    def generate(self, user_questions):
        import torch
        from vllm import LLM
        from vllm.sequence import MultiModalData

        prompt = "<image>" * 576 + (
            "\nUSER: What is the content of this image?\nASSISTANT:")

        images = torch.load("images/stop_sign_image_features.pt")

        outputs = self.llm.generate(prompt,
                                    multi_modal_data=MultiModalData(
                                        type=MultiModalData.Type.IMAGE, data=images))
        for o in outputs:
            generated_text = o.outputs[0].text
            print(generated_text)

    @exit()
    def stop_engine(self):
        if GPU_CONFIG.count > 1:
            import ray
            ray.shutdown()

@stub.local_entrypoint()
def main():
    model = Model()
    questions = [
        "Implement a Python function to compute the Fibonacci numbers.",
    ]
    model.generate.remote(questions)

chricro · 2024-04-09T14:32:42Z

@ywang96 thank you for the work you're doing, I can't wait to see the results!

alsichcan · 2024-04-09T18:41:12Z

@xwjiang2010
First of all, thank you for your amazing work. Your work have paved a way for my research!

I am working on developing an OpenAI-compatible server for LLaVa #3873 and have encountered a couple of points where I seek your guidance and wish to offer some suggestions.

Output from LLaVa Example Code:
While executing llava_example.py, I observed that the output appears to be truncated:
The image features several elements set in a city environment. There is a stop sign
Could you confirm if this output is as designed, or are there additional configurations needed to ensure complete and detailed responses?
Image Preparation Guidelines:
The instructions for preparing images are somewhat vague, mentioning only that this task should be handled by an external component in llava_example.py. However, by delving into the comments within llava.py, one can uncover further details on preparing image inputs:

PIXEL_VALUES: 
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
IMAGE_FEATURES:
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
before going through the multi modal projector.

To enhance user convenience, I propose the integration of a feature that automates the conversion of raw image files into image features, eliminating the need for users to manually prepare .pt files for utilizing LLaVa with vLLM. This enhancement would align the process more closely with the OpenAI API's format, which accepts images via URL or base64-encoded local files in formats such as PNG, JPEG, WEBP, and GIF.

Thank you for considering these points. I am eager to hear your thoughts and look forward to continuing to leverage the impressive capabilities of your work.

Best regards,

DarkLight1337 · 2024-04-10T11:21:48Z

Output from LLaVa Example Code:
While executing llava_example.py, I observed that the output appears to be truncated:
The image features several elements set in a city environment. There is a stop sign
Could you confirm if this output is as designed, or are there additional configurations needed to ensure complete and detailed responses?

You have to set the max_tokens parameter to a higher value to avoid truncated output.

Image Preparation Guidelines:
The instructions for preparing images are somewhat vague, mentioning only that this task should be handled by an external component in llava_example.py. However, by delving into the comments within llava.py, one can uncover further details on preparing image inputs:
PIXEL_VALUES: 
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
IMAGE_FEATURES:
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
before going through the multi modal projector.

I would like to add to your point. The current example script requires the use of S3 which is not convenient to set up. While developing support for OpenAI image input API, I personally passed URLs to online images for testing. Perhaps the example should be modified later so that S3 is no longer required.

ywang96 · 2024-04-10T16:33:50Z

@xwjiang2010 First of all, thank you for your amazing work. Your work have paved a way for my research!

I am working on developing an OpenAI-compatible server for LLaVa #3873 and have encountered a couple of points where I seek your guidance and wish to offer some suggestions.

Output from LLaVa Example Code:
While executing llava_example.py, I observed that the output appears to be truncated:
The image features several elements set in a city environment. There is a stop sign
Could you confirm if this output is as designed, or are there additional configurations needed to ensure complete and detailed responses?

Image Preparation Guidelines:
The instructions for preparing images are somewhat vague, mentioning only that this task should be handled by an external component in llava_example.py. However, by delving into the comments within llava.py, one can uncover further details on preparing image inputs:
PIXEL_VALUES: 
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L353
IMAGE_FEATURES:
- https://github.com/huggingface/transformers/blob/07bdbeb/src/transformers/models/llava/modeling_llava.py#L430
before going through the multi modal projector.
To enhance user convenience, I propose the integration of a feature that automates the conversion of raw image files into image features, eliminating the need for users to manually prepare .pt files for utilizing LLaVa with vLLM. This enhancement would align the process more closely with the OpenAI API's format, which accepts images via URL or base64-encoded local files in formats such as PNG, JPEG, WEBP, and GIF.

Thank you for considering these points. I am eager to hear your thoughts and look forward to continuing to leverage the impressive capabilities of your work.

Best regards,

@alsichcan I personally agree with your point. That's why I've been taking time to think about the best way to put such helper module in vllm and integrate it with the current vision language model framework, and this can also be the module to bridge the engine and the API server if we eventually build image API into it as well.

DarkLight1337 · 2024-04-18T03:45:59Z

@WoosukKwon I think you should close #1286 and #1751 as well, since they have been resolved by this PR.

ywang96 · 2024-04-18T04:00:13Z

@DarkLight1337 @alsichcan FYI - While working on adding support for Llava-Next, I realize the current design for vision models is too specific to Llava1.5 and probably not generalizable to support other multi-modal models, along with things that are missing to support end-to-end inference with the API server that has been addressed in #3978.

I'm working on a RFC to share some thoughts for refactoring and will send out tomorrow.

Iven2132 · 2024-04-21T14:56:10Z

Hey @ywang96 Can I use my fine-tuned PEFT LLava model with vllm? I'm writing a notebook for Brev that I want to share with the world, but I'm stuck in this problem, Can you please help me out? here is the fine-tuned model on HuggingFace marksuccsmfewercoc/llava-1.5-7b-hf-ft-mix-vsft

clj55 · 2024-06-07T01:44:49Z

I tried using the llava_example.py from [https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html] but encountering ModuleNotFoundError: No module named 'vllm.multimodal'

I pip installed vllm version 0.4.3

Anyone know what's the issue?

DarkLight1337 · 2024-06-07T01:47:39Z

I tried using the llava_example.py from [https://docs.vllm.ai/en/latest/getting_started/examples/llava_example.html] but encountering ModuleNotFoundError: No module named 'vllm.multimodal'

I pip installed vllm version 0.4.3

Anyone know what's the issue?

You are using the docs for latest version, not v0.4.3. The API has changed since then.

SaltFish11 · 2024-07-03T15:01:35Z

This is a very interesting job, but I have two questions here. I hope the author can answer:

Note that the vit module uses modules from the CLIP model in the huggingface repository. Instead of VLLM based attention, does llava support TP>1?
Does llava currently support the prefix_caching function?

ywang96 · 2024-08-04T03:49:31Z

This is a very interesting job, but I have two questions here. I hope the author can answer:

Note that the vit module uses modules from the CLIP model in the huggingface repository. Instead of VLLM based attention, does llava support TP>1?

Does llava currently support the prefix_caching function?

We do support TP of VLMs, except the ViTs in the model are currently not TP'ed but replicated on each GPU. The reason is that we didn't see performance benefit from TPing these ViTs, but this is definitely something we want to work more closely on in the future.
It should still support the text part but we currently don't have a good way to cache the image embeddings, so will go through the ViT regardless.

[Feature] Support Llava.

682e1a0

Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com>

xwjiang2010 changed the title ~~[Do not review] [Feature] Support Llava.~~ [Feature] Support Llava. Feb 26, 2024

xwjiang2010 changed the title ~~[Feature] Support Llava.~~ [Feature] Add vision language model support. Feb 26, 2024

simon-mo requested a review from pcmoritz February 28, 2024 18:30

simon-mo assigned pcmoritz Feb 28, 2024

pcmoritz reviewed Feb 29, 2024

View reviewed changes

esmeetu added the new model Requests to new models label Mar 2, 2024

zhuohan123 self-assigned this Mar 3, 2024