-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Support GPT-4V Chat Completions API #4200
Conversation
…ions API and legacy Completions API
- Also add docs for basic VLM usage
f0ee7b6
to
32aa3c7
Compare
Hello, nice work adding mutimodal support! I'm trying out your PRs but running into some issues. I'm starting the openai server with this (using the Dockerfile) chat_response = client.chat.completions.create(
model="llava-hf/llava-1.5-7b-hf",
messages=[{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
},
{"type": "text", "text": "What's in this image?"},
],
}],
) But I'm getting these errors: With #3978:
With #4200:
So it seems like it isn't adding the image tokens correctly. Am I doing something wrong? Would appreciate it if you could point me in the right direction! Edit: OK it seems to be working when run outside of docker, so it seem the problem is |
Since #4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in #4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
hi @DarkLight1337, I found the llava model with gpt-4v chat completion will always output a white space at the beginning, while other pure text model will not have this issue, do you have any idea why this might happen? ChatCompletion(id='cmpl-564bda62019242aa9ce6cf3d3e955ae7', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content=' The image captures a lush green field under a blue cloudy sky. A boardwalk or wooden trail is positioned in the middle of the field, leading towards a wooded area. The trail appears to be a pathway designed for walking or recreational purposes.\n\nIn the surrounding landscape, various trees can be seen in the distance, adding to the serene atmosphere of the scene. The combination of the vibrant green grass, the blue sky, and the trail creates a picturesque outdoor setting.', role='assistant', function_call=None, tool_calls=None), stop_reason=None)], created=1715818283, model='llava-hf/llava-1.5-7b-hf', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=107, prompt_tokens=595, total_tokens=702)) |
The inputs to |
I have narrowed down the issue to this line. For some reason, vllm/vllm/transformers_utils/detokenizer.py Line 115 in 52f8107
|
After some digging, it seems that this is expected behaviour because there is no extra whitespace after vLLM code >>> from vllm import LLM
>>> llm = LLM(
... model="llava-hf/llava-1.5-7b-hf",
... image_input_type="pixel_values",
... image_token_id=32000,
... image_input_shape="1,3,336,336",
... image_feature_size=576,
... )
>>> [o.outputs[0].text for o in llm.generate("USER: What is 1+1?\nASSISTANT:")]
[' 1+1 is equal to 2.']
>>> [o.outputs[0].text for o in llm.generate("USER: What is 1+1?\nASSISTANT: ")]
['1+1 is equal to 2.'] HuggingFace code >>> from transformers import AutoProcessor, LlavaForConditionalGeneration
>>> llm = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
>>> processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf"
>>> processor.batch_decode(llm.generate(**processor(text="USER: What is 1+1?\nASSISTANT:"), max_new_tokens=15))
['<s> USER: What is 1+1?\nASSISTANT: 1+1 is equal to 2.</s>']
>>> processor.batch_decode(llm.generate(**processor(text="USER: What is 1+1?\nASSISTANT: "), max_new_tokens=15))
['<s> USER: What is 1+1?\nASSISTANT: 1+1 is equal to 2.</s>'] I have tried adding a space after |
Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
{%- for message in messages -%} | ||
{{ message['role'].upper() + ': ' + message['content'] }} | ||
{%- if (loop.last and add_generation_prompt) or not loop.last -%} | ||
{{- '\n' -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should actually be no '\n'
Going by the vicuna_v1
conversation style used for llava-v1.5-13b
eg here
Which has sep2="</s>"
The initial prompt will look like this:
In [3]: from llava import conversation
In [4]: conv = conversation.conv_vicuna_v1.copy()
In [5]: conv.append_message(conv.roles[0], "Hi")
In [6]: conv.append_message(conv.roles[1], None)
In [7]: conv.get_prompt()
Out[7]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT:"
And then continues like this with a </s>
after each assistant response.
In [9]: conv.messages[-1][-1] = " Hello I am LLaVA".strip() # The model will add with space looks like it gets stripped
eg here
In [10]: conv.get_prompt()
Out[10]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>"
In [11]: conv.append_message(conv.roles[0], "What is the capital of France?")
In [12]: conv.append_message(conv.roles[1], None)
In [13]: conv.get_prompt()
Out[13]: "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Hi ASSISTANT: Hello I am LLaVA</s>USER: What is the capital of France? ASSISTANT:"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modeled the chat template according to their HF repo. Their example used a newline right before ASSISTANT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, llama.cpp also has the same (before ASSISTANT
and before the first USER
after the system prompt) - I guess it is mostly compatible with the original llava style if the user messages could end with a newline during training - although the jinja template also isn't handling the system prompt (which seems is added with no SYSTEM:
prefix in llama.cpp and LLaVA repo's conv_vicuna_v1
)
Will (and should?) </s>
also get added after the ASSISTANT answer? I guess it will have been output from the model since it's the eos token but not sure if it gets stripped at some point before making it to jinja.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the HuggingFace code (above), you can see that EOS token is included in the output. However this is removed in vLLM, presumably in favor of returning the generated text in a more user-friendly manner.
Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
After some offline discussion, I have handed over this feature to @ywang96, who has opened #5237. This PR is thus no longer considered a candidate for being merged, and only exists as a reference. Still, I have updated this PR to no longer depend on #4028 and resolved conflicts with the current |
Since vllm-project#4335 was merged, I've noticed that the definition of ServerRunner in the tests is the same as in the test for OpenAI API. I have moved the class to the test utilities to avoid code duplication. (Although it only has been repeated twice so far, I will add another similar test suite in vllm-project#4200 which would duplicate the code a third time) Also, I have moved the test utilities file (test_utils.py) to under the test directory (tests/utils.py), since none of its code is actually used in the main package. Note that I have added __init__.py to each test subpackage and updated the ray.init() call in the test utilities file in order to relative import tests/utils.py.
I have refactored the OpenAI-compatible server to support image input in OpenAI Chat Completions API, as specified by GPT-4 with Vision (GPT-4V). This enables easy access to VLMs in vLLM via the OpenAI client.
FIX #2058
FIX #3873
FIX #4205
FIX #4826
Features
detail
parameter is not applicable to most models, so it is ignored.examples/template_llava.jinja
) has been added to the repository so that you do not have to come up with your own to runllava-hf/llava-1.5-7b-hf
etc. (those models fail to provide the template inconfig.json
).llava-hf/llava-v1.6-34b-hf
. The model seems to work but I haven't checked the input prompt specifically.Related Contributions
This PR directly resolves issues #2058 and #3873. It also supersedes PR #3467.
This PR completes part of #3978.
This PR depends on #4197; it is set to Draft status until it is merged. Afterwards, you should be able to see the real diffs made exclusively by this PR.
To avoid unnecessary resource usage, this branch is frozen (except for critical fixes) until its dependencies have all been merged.
Future Work
Develop a more extensible framework of representing images from OpenAI API in the text prompt.
LlavaForConditionalGeneration
models in vLLM require that the<image>
token be repeatedimage_feature_size
times. This differs from the documentation on HuggingFace which specifies that only one<image>
token needs to be provided. I think that the implementation details of duplicating<image>
tokens in vLLM should be hidden from the user.image_openai
attribute toVisionLanguageConfig
to facilitate this.images
attribute alongsiderole
andcontent
). This transfers the burden of implementation to the maintainers of the model on HuggingFace, making it more likely that vLLM users have to implement their own template. I have createdConversationMessage
class to represent the dictionary for each message.