Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend] Support GPT-4V Chat Completions API #4200

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
ce770f4
Use discriminated union in prompt parsing
DarkLight1337 Apr 12, 2024
6b016bc
Fix some type errors along the way
DarkLight1337 Apr 12, 2024
7620354
Some more fixes
DarkLight1337 Apr 12, 2024
7c3e6d9
Apply formatter
DarkLight1337 Apr 12, 2024
7bdc84e
Refactor prompt parsing so that it can be shared between Chat Complet…
DarkLight1337 Apr 12, 2024
a7d1098
Make code more readable
DarkLight1337 Apr 12, 2024
8b9d636
Move assertion to a more appropriate place
DarkLight1337 Apr 12, 2024
c48c13a
Add code documentation
DarkLight1337 Apr 12, 2024
3530362
Decompose `_validate_prompt_and_tokenize`
DarkLight1337 Apr 12, 2024
b8feec9
Fix missing import due to renaming
DarkLight1337 Apr 12, 2024
89d9086
Merge branch 'upstream' into openai-typing
DarkLight1337 Apr 13, 2024
cc1a5b3
Fix bug when parsing array of tokens
DarkLight1337 Apr 13, 2024
f9c1135
Add token array to batch completions testing
DarkLight1337 Apr 13, 2024
f2e8180
Replace legacy `conint` with `Annotated` field
DarkLight1337 Apr 14, 2024
797326b
Merge branch 'upstream' into openai-typing
DarkLight1337 Apr 19, 2024
a26badd
Support image processor
DarkLight1337 Apr 19, 2024
8f991a3
Merge branch 'mm-data-processor' into openai-gpt4v
DarkLight1337 Apr 19, 2024
32aa3c7
Support GPT-4V Chat Completions API - Update VLM docs accordingly
DarkLight1337 Apr 19, 2024
5e099be
Chat template usage is already documented so no need to mention it again
DarkLight1337 May 8, 2024
6883061
Merge branch 'upstream' into openai-gpt4v
DarkLight1337 Jun 4, 2024
3d376bf
Fix some merge issues
DarkLight1337 Jun 4, 2024
81676b4
Update doc
DarkLight1337 Jun 4, 2024
a8d4875
Code cleanup and fix wrong inputs
DarkLight1337 Jun 4, 2024
57d65eb
Fix tests w.r.t. #5026
DarkLight1337 Jun 4, 2024
ddf3f06
Fix wrong number of expected tokens
DarkLight1337 Jun 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,9 @@ steps:
- label: LogitsProcessor Test
command: pytest -v -s test_logits_processor.py

- label: Utils Test
command: pytest -v -s test_utils.py

- label: Worker Test
command: pytest -v -s worker

Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ vLLM seamlessly supports many Hugging Face models, including the following archi
- InternLM2 (`internlm/internlm2-7b`, `internlm/internlm2-chat-7b`, etc.)
- Jais (`core42/jais-13b`, `core42/jais-13b-chat`, `core42/jais-30b-v3`, `core42/jais-30b-chat-v3`, etc.)
- LLaMA, Llama 2, and Meta Llama 3 (`meta-llama/Meta-Llama-3-8B-Instruct`, `meta-llama/Meta-Llama-3-70B-Instruct`, `meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- LLavA-1.5 (`llava-hf/llava-1.5-7b-hf`, `llava-hf/llava-1.5-13b-hf`, etc.)
- MiniCPM (`openbmb/MiniCPM-2B-sft-bf16`, `openbmb/MiniCPM-2B-dpo-bf16`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- Mixtral (`mistralai/Mixtral-8x7B-v0.1`, `mistralai/Mixtral-8x7B-Instruct-v0.1`, `mistral-community/Mixtral-8x22B-v0.1`, etc.)
Expand Down
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -85,6 +85,7 @@ Documentation
models/adding_model
models/engine_args
models/lora
models/vlm

.. toctree::
:maxdepth: 1
Expand Down
18 changes: 18 additions & 0 deletions docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,24 @@ Alongside each architecture, we include some popular models that use it.
- LLaMA, Llama 2, Meta Llama 3, Vicuna, Alpaca, Yi
- :code:`meta-llama/Meta-Llama-3-8B-Instruct`, :code:`meta-llama/Meta-Llama-3-70B-Instruct`, :code:`meta-llama/Llama-2-13b-hf`, :code:`meta-llama/Llama-2-70b-hf`, :code:`openlm-research/open_llama_13b`, :code:`lmsys/vicuna-13b-v1.3`, :code:`01-ai/Yi-6B`, :code:`01-ai/Yi-34B`, etc.
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`\*, :code:`llava-hf/llava-1.5-13b-hf`\*, etc.

.. note::

Models with an asterisk (\*) are missing :code:`chat_template` from HuggingFace :code:`config.json`. A predefined template can be found in our repo (:code:`examples/template_llava.jinja`). To host the OpenAI-compatible server, provide the chat template via command-line arguments. You also need to provide the :code:`VisionLanguageConfig` to initialize the model. See the following example:

.. code-block:: shell

$ python -m vllm.entrypoints.openai.api_server \
--model llava-hf/llava-1.5-7b-hf \
--chat-template examples/template_llava.jinja \
--image-input-type pixel_values \
--image-token-id 32000 \
--image-input-shape 1,3,336,336 \
--image-feature-size 576
-
* - :code:`MiniCPMForCausalLM`
- MiniCPM
- :code:`openbmb/MiniCPM-2B-sft-bf16`, :code:`openbmb/MiniCPM-2B-dpo-bf16`, etc.
Expand Down
118 changes: 118 additions & 0 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
.. _vlm:

Using VLMs
==========

This document shows you how to run and serve Vision Language Models (VLMs) using vLLM.

Additional Engine Arguments
---------------------------

Apart from the :ref:`basic engine arguments <engine_args>`, VLMs additionally require the following engine arguments for vLLM.

.. option:: --image-input-type {pixel_values,image_features}

The image input type passed into vLLM. Should be one of "pixel_values" or "image_features".

.. option:: --image-token-id <id>

Input ID for image token.

.. option:: --image-input-shape <tuple>

The biggest image input shape (worst for memory footprint) given an input type. Only used for vLLM's profile_run.

For example, if the image tensor has shape :code:`(1, 3, 336, 336)`, then you should pass :code:`--image-input-shape 1,3,336,336`.

.. option:: --image-feature-size <size>

The image feature size along the context dimension.

.. option:: --image-processor <size>

Name or path of the huggingface image processor to use.

.. option:: --image-processor-revision <revision>

The specific image processor version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.

.. option:: --no-image-processor

Disables the use of image processor, even if one is defined for the model on huggingface.

Offline Batched Inference
-------------------------

To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine.

.. code-block:: python

llm = LLM(
model="llava-hf/llava-1.5-7b-hf",
image_input_type="pixel_values",
image_token_id=32000,
image_input_shape="1,3,336,336",
image_feature_size=576,
)

For now, we only support a single image per text prompt when calling ``llm.generate``. To pass an image to the model, note the following parameters:

* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``multi_modal_datas``: This should be an instance of ``ImagePixelData``.

.. code-block:: python

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")

# Load the image using PIL.Image
image = ...

outputs = llm.generate(prompt, multi_modal_datas=ImagePixelData(image))

for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)

A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

OpenAI-Compatible Server
------------------------

We support image inputs to the OpenAI Chat API, as described in `GPT-4 with Vision <https://platform.openai.com/docs/guides/vision>`_.

Here is a simple example using the :code:`openai` package:

.. code-block:: python

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)

# Note that this model expects the image to come before the main text
chat_response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
{"type": "text", "text": "What's in this image?"},
],
}],
)
print("Chat response:", chat_response)

.. note::

For now, we only support a single image per API call. Also, the ``detail`` parameter is ignored since it may not be applicable to other models.
15 changes: 6 additions & 9 deletions examples/llava_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
import subprocess

import torch
from PIL import Image

from vllm import LLM
from vllm.sequence import MultiModalData
from vllm.sequence import ImageFeatureData, ImagePixelData

# The assets are located at `s3://air-example-data-2/vllm_opensource_llava/`.

Expand All @@ -23,11 +24,9 @@ def run_llava_pixel_values():
"\nUSER: What is the content of this image?\nASSISTANT:")

# This should be provided by another online or offline component.
images = torch.load("images/stop_sign_pixel_values.pt")
image = Image.open("images/stop_sign.jpg")

outputs = llm.generate(prompt,
multi_modal_data=MultiModalData(
type=MultiModalData.Type.IMAGE, data=images))
outputs = llm.generate(prompt, multi_modal_datas=ImagePixelData(image))
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Expand All @@ -46,11 +45,9 @@ def run_llava_image_features():
"\nUSER: What is the content of this image?\nASSISTANT:")

# This should be provided by another online or offline component.
images = torch.load("images/stop_sign_image_features.pt")
image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")

outputs = llm.generate(prompt,
multi_modal_data=MultiModalData(
type=MultiModalData.Type.IMAGE, data=images))
outputs = llm.generate(prompt, multi_modal_datas=ImageFeatureData(image))
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
Expand Down
11 changes: 11 additions & 0 deletions examples/template_llava.jinja
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{%- for message in messages -%}
{{ message['role'].upper() + ': ' + message['content'] }}
{%- if (loop.last and add_generation_prompt) or not loop.last -%}
{{- '\n' -}}
Copy link

@jamt9000 jamt9000 May 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should actually be no '\n'

Going by the vicuna_v1 conversation style used for llava-v1.5-13b eg here

Which has sep2="</s>"

https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/conversation.py#L242-L252

The initial prompt will look like this:

In [3]: from llava import conversation

In [4]: conv = conversation.conv_vicuna_v1.copy()

In [5]: conv.append_message(conv.roles[0], "Hi")

In [6]: conv.append_message(conv.roles[1], None)

In [7]: conv.get_prompt()
Out[7]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT:"

And then continues like this with a </s> after each assistant response.

In [9]: conv.messages[-1][-1] = " Hello I am LLaVA".strip() # The model will add with space looks like it gets stripped eg here
In [10]: conv.get_prompt()
Out[10]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>"
In [11]: conv.append_message(conv.roles[0], "What is the capital of France?")
In [12]: conv.append_message(conv.roles[1], None)
In [13]: conv.get_prompt()
Out[13]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>USER: What is the capital of France? ASSISTANT:"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modeled the chat template according to their HF repo. Their example used a newline right before ASSISTANT.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, llama.cpp also has the same (before ASSISTANT and before the first USER after the system prompt) - I guess it is mostly compatible with the original llava style if the user messages could end with a newline during training - although the jinja template also isn't handling the system prompt (which seems is added with no SYSTEM: prefix in llama.cpp and LLaVA repo's conv_vicuna_v1)

Will (and should?) </s> also get added after the ASSISTANT answer? I guess it will have been output from the model since it's the eos token but not sure if it gets stripped at some point before making it to jinja.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the HuggingFace code (above), you can see that EOS token is included in the output. However this is removed in vLLM, presumably in favor of returning the generated text in a more user-friendly manner.

{%- endif -%}
{%- endfor -%}


{%- if add_generation_prompt and messages[-1]['role'] != 'assistant' -%}
{{- 'ASSISTANT:' -}}
{% endif %}
6 changes: 5 additions & 1 deletion requirements-common.txt
Original file line number Diff line number Diff line change
Expand Up @@ -9,10 +9,14 @@ transformers >= 4.40.0 # Required for StarCoder2 & Llava, Llama 3.
tokenizers >= 0.19.1 # Required for Llama 3.
fastapi
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
prometheus_client >= 0.18.0
tiktoken == 0.6.0 # Required for DBRX tokenizer
lm-format-enforcer == 0.9.3
outlines == 0.0.34 # Requires torch >= 2.1.0
typing_extensions
filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4

# OpenAI server
openai
pydantic >= 2.0
pillow
4 changes: 0 additions & 4 deletions requirements-dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ pytest-rerunfailures
pytest-shard
httpx
einops # required for MPT
openai
requests
ray
peft
Expand All @@ -30,6 +29,3 @@ ai2-olmo # required for OLMo

# Benchmarking
aiohttp

# Multimodal
pillow
Loading
Loading