Skip to content
This repository was archived by the owner on Oct 11, 2024. It is now read-only.

Commit 255f3ed

Browse files
xwjiang2010ywang96
authored andcommitted
[vlm] Remove vision language config. (vllm-project#6089)
Signed-off-by: Xiaowei Jiang <xwjiang2010@gmail.com> Co-authored-by: Roger Wang <ywang@roblox.com>
1 parent add900f commit 255f3ed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+372
-466
lines changed

docs/source/dev/multimodal/multimodal_index.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,13 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
1010
:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
1111
which allows you to pass in multi-modal input alongside text and token prompts.
1212

13+
.. note::
14+
``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
15+
:class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
16+
1317
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.
1418

19+
1520
# TODO: Add more instructions on how to do that once embeddings is in.
1621

1722
Guides

docs/source/models/vlm.rst

Lines changed: 39 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,6 @@ vLLM provides experimental support for Vision Language Models (VLMs). This docum
88
.. important::
99
We are actively iterating on VLM support. Expect breaking changes to VLM usage and development in upcoming releases without prior deprecation.
1010

11-
Engine Arguments
12-
----------------
13-
14-
The following :ref:`engine arguments <engine_args>` are specific to VLMs:
15-
16-
.. argparse::
17-
:module: vllm.engine.arg_utils
18-
:func: _vlm_engine_args_parser
19-
:prog: -m vllm.entrypoints.openai.api_server
20-
:nodefaultconst:
21-
22-
.. important::
2311
Currently, the support for vision language models on vLLM has the following limitations:
2412

2513
* Only single image input is supported per text prompt.
@@ -33,40 +21,33 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``
3321

3422
.. code-block:: python
3523
36-
llm = LLM(
37-
model="llava-hf/llava-1.5-7b-hf",
38-
image_token_id=32000,
39-
image_input_shape="1,3,336,336",
40-
image_feature_size=576,
41-
)
24+
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
4225
4326
.. important::
44-
Currently, you have to specify ``image_feature_size`` to support memory profiling.
45-
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
46-
The calculation of feature size is specific to the model. For more details, please refer to
47-
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
27+
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
28+
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
29+
every model to perform profiling with.
4830

49-
We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
31+
This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through
32+
:meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>`
33+
for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced
34+
with a more accurate profiling strategy in the future.
5035

5136

5237
To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:
5338

5439
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
5540
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
5641

57-
.. note::
58-
59-
``multi_modal_data`` can accept keys and values beyond the builtin ones, as long as a customized plugin is registered through
60-
:class:`vllm.multimodal.MULTIMODAL_REGISTRY`.
61-
6242
.. code-block:: python
6343
6444
# Refer to the HuggingFace repo for the correct format to use
6545
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
6646
6747
# Load the image using PIL.Image
68-
image = ...
69-
48+
image = PIL.Image.open(...)
49+
50+
# Single prompt inference
7051
outputs = llm.generate({
7152
"prompt": prompt,
7253
"multi_modal_data": {"image": image},
@@ -75,6 +56,26 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
7556
for o in outputs:
7657
generated_text = o.outputs[0].text
7758
print(generated_text)
59+
60+
# Batch inference
61+
image_1 = PIL.Image.open(...)
62+
image_2 = PIL.Image.open(...)
63+
outputs = llm.generate(
64+
[
65+
{
66+
"prompt": "USER: <image>\nWhat is the content of this image?\nASSISTANT:",
67+
"multi_modal_data": {"image": image_1},
68+
},
69+
{
70+
"prompt": "USER: <image>\nWhat's the color of this image?\nASSISTANT:",
71+
"multi_modal_data": {"image": image_2},
72+
}
73+
]
74+
)
75+
76+
for o in outputs:
77+
generated_text = o.outputs[0].text
78+
print(generated_text)
7879
7980
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.
8081

@@ -99,18 +100,17 @@ Below is an example on how to launch the same ``llava-hf/llava-1.5-7b-hf`` with
99100
100101
python -m vllm.entrypoints.openai.api_server \
101102
--model llava-hf/llava-1.5-7b-hf \
102-
--image-token-id 32000 \
103-
--image-input-shape 1,3,336,336 \
104-
--image-feature-size 576 \
105103
--chat-template template_llava.jinja
106104
107105
.. important::
108-
Currently, you have to specify ``image_feature_size`` to support memory profiling.
109-
To avoid OOM during runtime, you should set this to the maximum value supported by the model.
110-
The calculation of feature size is specific to the model. For more details, please refer to
111-
the function :code:`get_<model_name>_image_feature_size` inside the corresponding model file.
112-
113-
We will remove most of the vision-specific arguments in a future release as they can be inferred from the HuggingFace configuration.
106+
We have removed all vision language related CLI args in the ``0.5.1`` release. **This is a breaking change**, so please update your code to follow
107+
the above snippet. Specifically, ``image_feature_size`` is no longer required to be specified, and internally we will construct data structures for
108+
every model to perform profiling with.
109+
110+
This work is still ongoing. In the meantime, we internally hardcode ``image_feature_size = 3000`` through
111+
:meth:`MULTIMODAL_REGISTRY.get_num_input_tokens <vllm.multimodal.MultiModalRegistry.get_num_input_tokens>`
112+
for every model to be conservative in terms of GPU memory consumption. This hardcoded value will be replaced
113+
with a more accurate profiling strategy in the future.
114114

115115
To consume the server, you can use the OpenAI client like in the example below:
116116

examples/llava_example.py

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,12 +10,7 @@
1010

1111

1212
def run_llava():
13-
llm = LLM(
14-
model="llava-hf/llava-1.5-7b-hf",
15-
image_token_id=32000,
16-
image_input_shape="1,3,336,336",
17-
image_feature_size=576,
18-
)
13+
llm = LLM(model="llava-hf/llava-1.5-7b-hf")
1914

2015
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
2116

examples/llava_next_example.py

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,13 +7,7 @@
77

88

99
def run_llava_next():
10-
llm = LLM(
11-
model="llava-hf/llava-v1.6-mistral-7b-hf",
12-
image_token_id=32000,
13-
image_input_shape="1,3,336,336",
14-
# Use the maximum possible value for memory profiling
15-
image_feature_size=2928,
16-
)
10+
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
1711

1812
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
1913
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"

examples/openai_vision_api_client.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,6 @@
33
Launch the vLLM server with the following command:
44
python -m vllm.entrypoints.openai.api_server \
55
--model llava-hf/llava-1.5-7b-hf \
6-
--image-token-id 32000 \
7-
--image-input-shape 1,3,336,336 \
8-
--image-feature-size 576 \
96
--chat-template template_llava.jinja
107
"""
118
import base64

examples/phi3v_example.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,15 +14,13 @@ def run_phi3v():
1414

1515
# Note: The default setting of max_num_seqs (256) and
1616
# max_model_len (128k) for this model may cause OOM.
17+
# You may lower either to run this example on lower-end GPUs.
18+
1719
# In this example, we override max_num_seqs to 5 while
1820
# keeping the original context length of 128k.
1921
llm = LLM(
2022
model=model_path,
2123
trust_remote_code=True,
22-
image_token_id=32044,
23-
image_input_shape="1,3,1008,1344",
24-
# Use the maximum possible value for memory profiling
25-
image_feature_size=2653,
2624
max_num_seqs=5,
2725
)
2826

tests/distributed/test_multimodal_broadcast.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -25,9 +25,9 @@
2525
model = os.environ["TEST_DIST_MODEL"]
2626

2727
if model.startswith("llava-hf/llava"):
28-
from ..models.test_llava import model_and_vl_config, run_test
28+
from ..models.test_llava import models, run_test
2929
elif model.startswith("microsoft/Phi-3-vision"):
30-
from ..models.test_phi3v import model_and_vl_config, run_test
30+
from ..models.test_phi3v import models, run_test
3131
else:
3232
raise NotImplementedError(f"Unsupported model: {model}")
3333

@@ -49,7 +49,7 @@ def test_models(hf_runner, vllm_runner, image_assets,
4949
hf_runner,
5050
vllm_runner,
5151
image_assets,
52-
model_and_config=model_and_vl_config[0],
52+
model=models[0],
5353
size_factors=[1.0],
5454
dtype=dtype,
5555
max_tokens=max_tokens,

tests/entrypoints/openai/test_vision.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -44,12 +44,6 @@ def server(ray_ctx):
4444
"--max-model-len",
4545
"4096",
4646
"--enforce-eager",
47-
"--image-token-id",
48-
"32000",
49-
"--image-input-shape",
50-
"1,3,336,336",
51-
"--image-feature-size",
52-
"576",
5347
"--chat-template",
5448
str(LLAVA_CHAT_TEMPLATE),
5549
])

tests/models/test_llava.py

Lines changed: 17 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
from transformers import AutoTokenizer
55

66
from tests.nm_utils.utils_skip import should_skip_test_group
7-
from vllm.config import VisionLanguageConfig
87
from vllm.multimodal.utils import rescale_image_size
98
from vllm.sequence import SampleLogprobs
109

@@ -26,49 +25,27 @@
2625
"USER: <image>\nWhat's in this image?\nASSISTANT:",
2726
})
2827

28+
IMAGE_TOKEN_ID = 32000
2929

30-
def iter_llava_configs(model_name: str):
31-
image_hw_to_feature_size = {
32-
(336, 336): 576,
33-
}
34-
35-
for (h, w), f in image_hw_to_feature_size.items():
36-
input_shape = (1, 3, h, w)
37-
yield (model_name,
38-
VisionLanguageConfig(image_feature_size=f,
39-
image_token_id=32000,
40-
image_input_shape=input_shape))
41-
42-
43-
model_and_vl_config = [
44-
*iter_llava_configs("llava-hf/llava-1.5-7b-hf"),
45-
]
30+
models = ["llava-hf/llava-1.5-7b-hf"]
4631

4732

4833
def vllm_to_hf_output(vllm_output: Tuple[List[int], str,
4934
Optional[SampleLogprobs]],
50-
vlm_config: VisionLanguageConfig, model_id: str):
51-
"""Sanitize vllm output to be comparable with hf output.
52-
The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
53-
x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
54-
It also reduces `output_str` from "<image><image>bla" to "bla".
55-
"""
35+
model: str):
36+
"""Sanitize vllm output to be comparable with hf output."""
5637
output_ids, output_str, out_logprobs = vllm_output
57-
image_token_id = vlm_config.image_token_id
5838

59-
tokenizer = AutoTokenizer.from_pretrained(model_id)
60-
image_token_str = tokenizer.decode(image_token_id)
39+
tokenizer = AutoTokenizer.from_pretrained(model)
6140
eos_token_id = tokenizer.eos_token_id
6241

6342
hf_output_ids = [
6443
token_id for idx, token_id in enumerate(output_ids)
65-
if token_id != image_token_id or output_ids[idx - 1] != image_token_id
44+
if token_id != IMAGE_TOKEN_ID or output_ids[idx - 1] != IMAGE_TOKEN_ID
6645
]
6746

68-
hf_output_str = output_str \
69-
.replace(image_token_str * vlm_config.image_feature_size, "")
70-
assert hf_output_str[0] == " "
71-
hf_output_str = hf_output_str[1:]
47+
assert output_str[0] == " "
48+
hf_output_str = output_str[1:]
7249
if hf_output_ids[-1] == eos_token_id:
7350
hf_output_str = hf_output_str + tokenizer.decode(eos_token_id)
7451

@@ -79,7 +56,7 @@ def run_test(
7956
hf_runner: Type[HfRunner],
8057
vllm_runner: Type[VllmRunner],
8158
image_assets: _ImageAssets,
82-
model_and_config: Tuple[str, VisionLanguageConfig],
59+
model: str,
8360
*,
8461
size_factors: List[float],
8562
dtype: str,
@@ -97,7 +74,6 @@ def run_test(
9774
Note, the text input is also adjusted to abide by vllm contract.
9875
The text output is sanitized to be able to compare with hf.
9976
"""
100-
model_id, vlm_config = model_and_config
10177
images = [asset.pil_image for asset in image_assets]
10278

10379
inputs_per_image = [(
@@ -111,12 +87,11 @@ def run_test(
11187
# will hurt multiprocessing backend with fork method (the default method).
11288

11389
# max_model_len should be greater than image_feature_size
114-
with vllm_runner(model_id,
90+
with vllm_runner(model,
11591
dtype=dtype,
11692
tensor_parallel_size=tensor_parallel_size,
11793
distributed_executor_backend=distributed_executor_backend,
118-
enforce_eager=True,
119-
**vlm_config.as_cli_args_dict()) as vllm_model:
94+
enforce_eager=True) as vllm_model:
12095
vllm_outputs_per_image = [
12196
vllm_model.generate_greedy_logprobs(prompts,
12297
max_tokens,
@@ -125,7 +100,7 @@ def run_test(
125100
for prompts, images in inputs_per_image
126101
]
127102

128-
with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model:
103+
with hf_runner(model, dtype=dtype, is_vision_model=True) as hf_model:
129104
hf_outputs_per_image = [
130105
hf_model.generate_greedy_logprobs_limit(prompts,
131106
max_tokens,
@@ -141,15 +116,15 @@ def run_test(
141116
check_logprobs_close(
142117
outputs_0_lst=hf_outputs,
143118
outputs_1_lst=[
144-
vllm_to_hf_output(vllm_output, vlm_config, model_id)
119+
vllm_to_hf_output(vllm_output, model)
145120
for vllm_output in vllm_outputs
146121
],
147122
name_0="hf",
148123
name_1="vllm",
149124
)
150125

151126

152-
@pytest.mark.parametrize("model_and_config", model_and_vl_config)
127+
@pytest.mark.parametrize("model", models)
153128
@pytest.mark.parametrize(
154129
"size_factors",
155130
[
@@ -166,14 +141,13 @@ def run_test(
166141
@pytest.mark.parametrize("dtype", ["half"])
167142
@pytest.mark.parametrize("max_tokens", [128])
168143
@pytest.mark.parametrize("num_logprobs", [5])
169-
def test_models(hf_runner, vllm_runner, image_assets, model_and_config,
170-
size_factors, dtype: str, max_tokens: int,
171-
num_logprobs: int) -> None:
144+
def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
145+
dtype: str, max_tokens: int, num_logprobs: int) -> None:
172146
run_test(
173147
hf_runner,
174148
vllm_runner,
175149
image_assets,
176-
model_and_config,
150+
model,
177151
size_factors=size_factors,
178152
dtype=dtype,
179153
max_tokens=max_tokens,

0 commit comments

Comments
 (0)