-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Support GPT-4V Chat Completions API #4200
Closed
Closed
Changes from 18 commits
Commits
Show all changes
25 commits
Select commit
Hold shift + click to select a range
ce770f4
Use discriminated union in prompt parsing
DarkLight1337 6b016bc
Fix some type errors along the way
DarkLight1337 7620354
Some more fixes
DarkLight1337 7c3e6d9
Apply formatter
DarkLight1337 7bdc84e
Refactor prompt parsing so that it can be shared between Chat Complet…
DarkLight1337 a7d1098
Make code more readable
DarkLight1337 8b9d636
Move assertion to a more appropriate place
DarkLight1337 c48c13a
Add code documentation
DarkLight1337 3530362
Decompose `_validate_prompt_and_tokenize`
DarkLight1337 b8feec9
Fix missing import due to renaming
DarkLight1337 89d9086
Merge branch 'upstream' into openai-typing
DarkLight1337 cc1a5b3
Fix bug when parsing array of tokens
DarkLight1337 f9c1135
Add token array to batch completions testing
DarkLight1337 f2e8180
Replace legacy `conint` with `Annotated` field
DarkLight1337 797326b
Merge branch 'upstream' into openai-typing
DarkLight1337 a26badd
Support image processor
DarkLight1337 8f991a3
Merge branch 'mm-data-processor' into openai-gpt4v
DarkLight1337 32aa3c7
Support GPT-4V Chat Completions API - Update VLM docs accordingly
DarkLight1337 5e099be
Chat template usage is already documented so no need to mention it again
DarkLight1337 6883061
Merge branch 'upstream' into openai-gpt4v
DarkLight1337 3d376bf
Fix some merge issues
DarkLight1337 81676b4
Update doc
DarkLight1337 a8d4875
Code cleanup and fix wrong inputs
DarkLight1337 57d65eb
Fix tests w.r.t. #5026
DarkLight1337 ddf3f06
Fix wrong number of expected tokens
DarkLight1337 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,118 @@ | ||
.. _vlm: | ||
|
||
Using VLMs | ||
========== | ||
|
||
This document shows you how to run and serve Vision Language Models (VLMs) using vLLM. | ||
|
||
Additional Engine Arguments | ||
--------------------------- | ||
|
||
Apart from the :ref:`basic engine arguments <engine_args>`, VLMs additionally require the following engine arguments for vLLM. | ||
|
||
.. option:: --image-input-type {pixel_values,image_features} | ||
|
||
The image input type passed into vLLM. Should be one of "pixel_values" or "image_features". | ||
|
||
.. option:: --image-token-id <id> | ||
|
||
Input ID for image token. | ||
|
||
.. option:: --image-input-shape <tuple> | ||
|
||
The biggest image input shape (worst for memory footprint) given an input type. Only used for vLLM's profile_run. | ||
|
||
For example, if the image tensor has shape :code:`(1, 3, 336, 336)`, then you should pass :code:`--image-input-shape 1,3,336,336`. | ||
|
||
.. option:: --image-feature-size <size> | ||
|
||
The image feature size along the context dimension. | ||
|
||
.. option:: --image-processor <size> | ||
|
||
Name or path of the huggingface image processor to use. | ||
|
||
.. option:: --image-processor-revision <revision> | ||
|
||
The specific image processor version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version. | ||
|
||
.. option:: --no-image-processor | ||
|
||
Disables the use of image processor, even if one is defined for the model on huggingface. | ||
|
||
Offline Batched Inference | ||
------------------------- | ||
|
||
To initialize a VLM, the aforementioned arguments must be passed to the ``LLM`` class for instantiating the engine. | ||
|
||
.. code-block:: python | ||
|
||
llm = LLM( | ||
model="llava-hf/llava-1.5-7b-hf", | ||
image_input_type="pixel_values", | ||
image_token_id=32000, | ||
image_input_shape="1,3,336,336", | ||
image_feature_size=576, | ||
) | ||
|
||
For now, we only support a single image per text prompt when calling ``llm.generate``. To pass an image to the model, note the following parameters: | ||
|
||
* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``. | ||
* ``multi_modal_datas``: This should be an instance of ``ImagePixelData``. | ||
|
||
.. code-block:: python | ||
|
||
prompt = "<image>" * 576 + ( | ||
"\nUSER: What is the content of this image?\nASSISTANT:") | ||
|
||
# Load the image using PIL.Image | ||
image = ... | ||
|
||
outputs = llm.generate(prompt, multi_modal_datas=ImagePixelData(image)) | ||
|
||
for o in outputs: | ||
generated_text = o.outputs[0].text | ||
print(generated_text) | ||
|
||
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_. | ||
|
||
OpenAI-Compatible Server | ||
------------------------ | ||
|
||
We support image inputs to the OpenAI Chat API, as described in `GPT-4 with Vision <https://platform.openai.com/docs/guides/vision>`_. | ||
|
||
Here is a simple example using the :code:`openai` package: | ||
|
||
.. code-block:: python | ||
|
||
from openai import OpenAI | ||
|
||
openai_api_key = "EMPTY" | ||
openai_api_base = "http://localhost:8000/v1" | ||
|
||
client = OpenAI( | ||
api_key=openai_api_key, | ||
base_url=openai_api_base, | ||
) | ||
|
||
# Note that this model expects the image to come before the main text | ||
chat_response = client.chat.completions.create( | ||
model="llava-hf/llava-1.5-7b-hf", | ||
messages=[{ | ||
"role": "user", | ||
"content": [ | ||
{ | ||
"type": "image_url", | ||
"image_url": { | ||
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", | ||
}, | ||
}, | ||
{"type": "text", "text": "What's in this image?"}, | ||
], | ||
}], | ||
) | ||
print("Chat response:", chat_response) | ||
|
||
.. note:: | ||
|
||
For now, we only support a single image per API call. Also, the ``detail`` parameter is ignored since it may not be applicable to other models. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
{%- for message in messages -%} | ||
{{ message['role'].upper() + ': ' + message['content'] }} | ||
{%- if (loop.last and add_generation_prompt) or not loop.last -%} | ||
{{- '\n' -}} | ||
{%- endif -%} | ||
{%- endfor -%} | ||
|
||
|
||
{%- if add_generation_prompt and messages[-1]['role'] != 'assistant' -%} | ||
{{- 'ASSISTANT:' -}} | ||
{% endif %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should actually be no
'\n'
Going by the
vicuna_v1
conversation style used forllava-v1.5-13b
eg hereWhich has
sep2="</s>"
https://github.com/haotian-liu/LLaVA/blob/c121f0432da27facab705978f83c4ada465e46fd/llava/conversation.py#L242-L252
The initial prompt will look like this:
And then continues like this with a
</s>
after each assistant response.In [9]: conv.messages[-1][-1] = " Hello I am LLaVA".strip() # The model will add with space looks like it gets stripped
eg hereIn [10]: conv.get_prompt()
Out[10]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>"
In [11]: conv.append_message(conv.roles[0], "What is the capital of France?")
In [12]: conv.append_message(conv.roles[1], None)
In [13]: conv.get_prompt()
Out[13]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>USER: What is the capital of France? ASSISTANT:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modeled the chat template according to their HF repo. Their example used a newline right before
ASSISTANT
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, llama.cpp also has the same (before
ASSISTANT
and before the firstUSER
after the system prompt) - I guess it is mostly compatible with the original llava style if the user messages could end with a newline during training - although the jinja template also isn't handling the system prompt (which seems is added with noSYSTEM:
prefix in llama.cpp and LLaVA repo'sconv_vicuna_v1
)Will (and should?)
</s>
also get added after the ASSISTANT answer? I guess it will have been output from the model since it's the eos token but not sure if it gets stripped at some point before making it to jinja.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the HuggingFace code (above), you can see that EOS token is included in the output. However this is removed in vLLM, presumably in favor of returning the generated text in a more user-friendly manner.