Skip to content

Commit 3824039

Browse files
DarkLight1337WoosukKwonywang96
authored
[VLM] Support pan-and-scan for Gemma3 multi-modal processor (#14672)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Roger Wang <ywang@roblox.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Roger Wang <ywang@roblox.com>
1 parent a73122d commit 3824039

File tree

9 files changed

+315
-81
lines changed

9 files changed

+315
-81
lines changed

docs/source/models/supported_models.md

Lines changed: 25 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -763,7 +763,7 @@ See [this page](#generative-models) for more information on how to use generativ
763763
* `google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc.
764764
* ✅︎
765765
* ✅︎
766-
* ✅︎\*
766+
* ⚠️
767767
- * `GLM4VForCausalLM`<sup>^</sup>
768768
* GLM-4V
769769
* T + I
@@ -856,12 +856,12 @@ See [this page](#generative-models) for more information on how to use generativ
856856
* ✅︎
857857
* ✅︎
858858
- * `PaliGemmaForConditionalGeneration`
859-
* PaliGemma ⚠️, PaliGemma 2 ⚠️
859+
* PaliGemma, PaliGemma 2
860860
* T + I<sup>E</sup>
861861
* `google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
862862
*
863863
* ✅︎
864-
* ✅︎
864+
* ⚠️
865865
- * `Phi3VForCausalLM`
866866
* Phi-3-Vision, Phi-3.5-Vision
867867
* T + I<sup>E+</sup>
@@ -926,34 +926,15 @@ See [this page](#generative-models) for more information on how to use generativ
926926
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
927927
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
928928

929-
:::{warning}
930-
vLLM does not currently support PrefixLM attention mask, so our PaliGemma implementation uses regular causal attention, which causes the model output to be unstable.
931-
932-
We may deprecate this model series in a future release.
933-
:::
934-
935-
:::{note}
936-
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support backends other than FlashAttention.
937-
:::
938-
939-
:::{note}
940-
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
941-
:::
942-
943-
:::{note}
944-
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
945-
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
946-
:::
947-
948-
:::{note}
949-
To use Qwen2.5-VL series models, you have to install Hugging Face Transformers library from source via `pip install git+https://github.com/huggingface/transformers`.
950-
:::
951-
952-
:::{note}
929+
:::{important}
953930
To use Gemma3 series models, you have to install Hugging Face Transformers library from source via
954931
`pip install git+https://github.com/huggingface/transformers`.
955-
The earliest commit that supports this is [`50d3530aa04e7a7d003e6b255a98f79fd0447357`](https://github.com/huggingface/transformers/commit/50d3530aa04e7a7d003e6b255a98f79fd0447357).
956932

933+
Pan-and-scan image pre-processing is currently supported on V0 (but not V1).
934+
You can enable it by passing `--mm-processor-kwargs '{"do_pan_and_scan": True}'`.
935+
:::
936+
937+
:::{warning}
957938
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
958939
However, there are differences in how they handle text + image inputs:
959940

@@ -969,9 +950,23 @@ V1 currently uses a simplified attention pattern:
969950
- Will be updated in the future to support the correct behavior
970951

971952
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
953+
:::
954+
955+
:::{note}
956+
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support backends other than FlashAttention.
957+
:::
972958

973-
Additionally, vLLM's current Gemma 3 implementation does not support the pan-and-scan image pre-processing algorithm, which helps handle images with skewed aspect ratios by intelligently cropping them into multiple views.
974-
Without this feature, model performance may degrade when processing images that deviate significantly from square dimensions.
959+
:::{note}
960+
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
961+
:::
962+
963+
:::{note}
964+
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
965+
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
966+
:::
967+
968+
:::{warning}
969+
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
975970
:::
976971

977972
### Pooling Models

examples/offline_inference/vision_language.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -123,10 +123,14 @@ def run_gemma3(questions: list[str], modality: str):
123123
assert modality == "image"
124124
model_name = "google/gemma-3-4b-it"
125125

126-
llm = LLM(model=model_name,
127-
max_model_len=2048,
128-
max_num_seqs=2,
129-
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
126+
llm = LLM(
127+
model=model_name,
128+
max_model_len=2048,
129+
max_num_seqs=2,
130+
# Default is False; setting it to True is not supported in V1 yet
131+
mm_processor_kwargs={"do_pan_and_scan": True},
132+
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
133+
)
130134

131135
prompts = [("<bos><start_of_turn>user\n"
132136
f"<start_of_image>{question}<end_of_turn>\n"

examples/offline_inference/vision_language_multi_image.py

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -83,10 +83,14 @@ def load_deepseek_vl2(question: str, image_urls: list[str]):
8383
def load_gemma3(question, image_urls: list[str]) -> ModelRequestData:
8484
model_name = "google/gemma-3-4b-it"
8585

86-
llm = LLM(model=model_name,
87-
max_model_len=8192,
88-
max_num_seqs=2,
89-
limit_mm_per_prompt={"image": len(image_urls)})
86+
llm = LLM(
87+
model=model_name,
88+
max_model_len=8192,
89+
max_num_seqs=2,
90+
# Default is False; setting it to True is not supported in V1 yet
91+
mm_processor_kwargs={"do_pan_and_scan": True},
92+
limit_mm_per_prompt={"image": len(image_urls)},
93+
)
9094

9195
placeholders = [{"type": "image", "image": url} for url in image_urls]
9296
messages = [{

tests/models/decoder_only/vision_language/test_models.py

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99

1010
import pytest
1111
from packaging.version import Version
12-
from transformers import AutoModelForVision2Seq
12+
from transformers import AutoModelForPreTraining, AutoModelForVision2Seq
1313
from transformers import __version__ as TRANSFORMERS_VERSION
1414

1515
from vllm.platforms import current_platform
@@ -234,6 +234,23 @@
234234
num_logprobs=10,
235235
image_size_factors=[(), (0.25,), (0.25, 0.25, 0.25), (0.25, 0.2, 0.15)],
236236
),
237+
"gemma3": VLMTestInfo(
238+
models=["google/gemma-3-4b-it"],
239+
test_type=(VLMTestType.IMAGE, VLMTestType.MULTI_IMAGE),
240+
prompt_formatter=lambda img_prompt: f"<bos><start_of_turn>user\n{img_prompt}<end_of_turn>\n<start_of_turn>model\n", # noqa: E501
241+
single_image_prompts=IMAGE_ASSETS.prompts({
242+
"stop_sign": "<start_of_image>What's the content in the center of the image?", # noqa: E501
243+
"cherry_blossom": "<start_of_image>What is the season?", # noqa: E501
244+
}),
245+
multi_image_prompt="<start_of_image><start_of_image>Describe the two images in detail.", # noqa: E501
246+
max_model_len=4096,
247+
max_num_seqs=2,
248+
# TODO: Use AutoModelForVision2Seq once transformers supports this
249+
auto_cls=AutoModelForPreTraining,
250+
dtype="bfloat16",
251+
vllm_runner_kwargs={"mm_processor_kwargs": {"do_pan_and_scan": True}},
252+
patch_hf_runner=model_utils.gemma3_patch_hf_runner,
253+
),
237254
"glm4v": VLMTestInfo(
238255
models=["THUDM/glm-4v-9b"],
239256
test_type=VLMTestType.IMAGE,

tests/models/decoder_only/vision_language/vlm_utils/model_utils.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,18 @@ def processor(*args, text="", images=None, **kwargs):
304304
return hf_model
305305

306306

307+
def gemma3_patch_hf_runner(hf_model: HfRunner) -> HfRunner:
308+
"""Patches and returns an instance of the HfRunner to use for Gemma 3."""
309+
hf_processor = hf_model.processor
310+
311+
def processor(*args, **kwargs):
312+
return hf_processor(*args, do_pan_and_scan=True, **kwargs)
313+
314+
hf_model.processor = processor
315+
316+
return hf_model
317+
318+
307319
def glm_patch_hf_runner(hf_model: HfRunner) -> HfRunner:
308320
"""Patches and returns an instance of the HfRunner to use for GLM4."""
309321
hf_processor = hf_model.processor

vllm/inputs/registry.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -348,7 +348,11 @@ def dummy_data_for_profiling(
348348
dummy_factory = self._get_dummy_data_factory(model_cls)
349349
mm_counts = mm_registry.get_mm_limits_per_prompt(model_config)
350350
mm_processor_kwargs = get_allowed_kwarg_only_overrides(
351-
dummy_factory, overrides=model_config.mm_processor_kwargs)
351+
dummy_factory,
352+
overrides=model_config.mm_processor_kwargs,
353+
requires_kw_only=False,
354+
allow_var_kwargs=True,
355+
)
352356

353357
dummy_data = dummy_factory(InputContext(model_config), seq_len,
354358
_MultiModalCounts(mm_counts),
@@ -381,6 +385,7 @@ def _default_input_processor(
381385
self,
382386
ctx: InputContext,
383387
inputs: ProcessorInputs,
388+
**kwargs: object,
384389
) -> ProcessorInputs:
385390
"""The default input processor is a no-op."""
386391
return inputs
@@ -447,6 +452,8 @@ def process_input(self, model_config: "ModelConfig",
447452
model_config.mm_processor_kwargs,
448453
inputs.get("mm_processor_kwargs", {}), # type: ignore
449454
processor,
455+
requires_kw_only=False,
456+
allow_var_kwargs=True,
450457
)
451458

452459
processed_inputs = processor(

0 commit comments

Comments
 (0)