Skip to content

Commit 2f56cad

Browse files
DarkLight1337LeiWang1999
authored andcommitted
[VLM][Core] Support profiling with multiple multi-modal inputs per prompt (vllm-project#7126)
Signed-off-by: LeiWang1999 <leiwang1999@outlook.com>
1 parent e69185d commit 2f56cad

38 files changed

+573
-217
lines changed

docs/source/dev/input_processing/input_processing_pipeline.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,4 @@ Input Processing Pipeline
1717

1818
6. If the data contains multi-modal data, convert it into keyword arguments using :meth:`MULTIMODAL_REGISTRY.map_input <vllm.multimodal.MultiModalRegistry.map_input>`.
1919

20-
- For example, convert a :class:`PIL.Image.Image` input to its pixel values for a vision language model.
20+
- For example, convert a :class:`PIL.Image.Image` input to its pixel values for a vision model.

docs/source/dev/multimodal/multimodal_index.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,9 @@ by following :ref:`this guide <adding_multimodal_plugin>`.
1515

1616
Looking to add your own multi-modal model? Please follow the instructions listed :ref:`here <enabling_multimodal_inputs>`.
1717

18+
..
19+
TODO: Add usage of --limit-mm-per-prompt when multi-image input is officially supported
20+
1821
Guides
1922
++++++
2023

docs/source/models/enabling_multimodal_inputs.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ A default mapper is available for each modality in the core vLLM library. This i
6666
3. Register maximum number of multi-modal tokens
6767
------------------------------------------------
6868

69-
For each modality type that the model accepts as input, calculate the maximum possible number of tokens
69+
For each modality type that the model accepts as input, calculate the maximum possible number of tokens per data instance
7070
and register it via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_max_multimodal_tokens>`.
7171

7272
.. code-block:: diff

tests/engine/test_arg_utils.py

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
import pytest
2+
3+
from vllm.engine.arg_utils import EngineArgs
4+
from vllm.utils import FlexibleArgumentParser
5+
6+
7+
@pytest.mark.parametrize(("arg", "expected"), [
8+
(None, None),
9+
("image=16", {
10+
"image": 16
11+
}),
12+
("image=16,video=2", {
13+
"image": 16,
14+
"video": 2
15+
}),
16+
])
17+
def test_limit_mm_per_prompt_parser(arg, expected):
18+
parser = EngineArgs.add_cli_args(FlexibleArgumentParser())
19+
if arg is None:
20+
args = parser.parse_args([])
21+
else:
22+
args = parser.parse_args(["--limit-mm-per-prompt", arg])
23+
24+
assert args.limit_mm_per_prompt == expected

tests/models/test_blip2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
5959
All the image fixtures for the test is under tests/images.
6060
For huggingface runner, we provide the PIL images as input.
6161
For vllm runner, we provide MultiModalData objects and corresponding
62-
vision language config as input.
62+
MultiModalConfig as input.
6363
Note, the text input is also adjusted to abide by vllm contract.
6464
The text output is sanitized to be able to compare with hf.
6565
"""

tests/models/test_fuyu.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def run_test(
4949
All the image fixtures for the test is under tests/images.
5050
For huggingface runner, we provide the PIL images as input.
5151
For vllm runner, we provide MultiModalDataDict objects
52-
and corresponding vision language config as input.
52+
and corresponding MultiModalConfig as input.
5353
Note, the text input is also adjusted to abide by vllm contract.
5454
The text output is sanitized to be able to compare with hf.
5555
"""

tests/models/test_internvl.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ def run_test(
117117
All the image fixtures for the test is under tests/images.
118118
For huggingface runner, we provide the PIL images as input.
119119
For vllm runner, we provide MultiModalDataDict objects
120-
and corresponding vision language config as input.
120+
and corresponding MultiModalConfig as input.
121121
Note, the text input is also adjusted to abide by vllm contract.
122122
The text output is sanitized to be able to compare with hf.
123123
"""

tests/models/test_llava.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ def run_test(
6969
All the image fixtures for the test is under tests/images.
7070
For huggingface runner, we provide the PIL images as input.
7171
For vllm runner, we provide MultiModalDataDict objects
72-
and corresponding vision language config as input.
72+
and corresponding MultiModalConfig as input.
7373
Note, the text input is also adjusted to abide by vllm contract.
7474
The text output is sanitized to be able to compare with hf.
7575
"""

tests/models/test_llava_next.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ def test_models(hf_runner, vllm_runner, image_assets, model, size_factors,
177177
All the image fixtures for the test is under tests/images.
178178
For huggingface runner, we provide the PIL images as input.
179179
For vllm runner, we provide MultiModalDataDict objects
180-
and corresponding vision language config as input.
180+
and corresponding MultiModalConfig as input.
181181
Note, the text input is also adjusted to abide by vllm contract.
182182
The text output is sanitized to be able to compare with hf.
183183
"""

tests/models/test_minicpmv.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ def run_test(
6161
All the image fixtures for the test is under tests/images.
6262
For huggingface runner, we provide the PIL images as input.
6363
For vllm runner, we provide MultiModalDataDict objects
64-
and corresponding vision language config as input.
64+
and corresponding MultiModalConfig as input.
6565
Note, the text input is also adjusted to abide by vllm contract.
6666
The text output is sanitized to be able to compare with hf.
6767
"""
@@ -176,7 +176,7 @@ def run_multi_image_test(
176176
All the image fixtures for the test is under tests/images.
177177
For huggingface runner, we provide the PIL images as input.
178178
For vllm runner, we provide MultiModalDataDict objects
179-
and corresponding vision language config as input.
179+
and corresponding MultiModalConfig as input.
180180
Note, the text input is also adjusted to abide by vllm contract.
181181
The text output is sanitized to be able to compare with hf.
182182
"""
@@ -197,6 +197,7 @@ def run_multi_image_test(
197197
with vllm_runner(model,
198198
max_model_len=4096,
199199
max_num_seqs=1,
200+
limit_mm_per_prompt={"image": len(images)},
200201
dtype=dtype,
201202
tensor_parallel_size=tensor_parallel_size,
202203
distributed_executor_backend=distributed_executor_backend,

0 commit comments

Comments
 (0)