You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[VLM] Support pan-and-scan for Gemma3 multi-modal processor (#14672)
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Co-authored-by: Roger Wang <ywang@roblox.com>
Copy file name to clipboardExpand all lines: docs/source/models/supported_models.md
+25-30Lines changed: 25 additions & 30 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -763,7 +763,7 @@ See [this page](#generative-models) for more information on how to use generativ
763
763
*`google/gemma-3-4b-it`, `google/gemma-3-27b-it`, etc.
764
764
* ✅︎
765
765
* ✅︎
766
-
*✅︎\*
766
+
*⚠️
767
767
-*`GLM4VForCausalLM`<sup>^</sup>
768
768
* GLM-4V
769
769
* T + I
@@ -856,12 +856,12 @@ See [this page](#generative-models) for more information on how to use generativ
856
856
* ✅︎
857
857
* ✅︎
858
858
-*`PaliGemmaForConditionalGeneration`
859
-
* PaliGemma ⚠️, PaliGemma 2 ⚠️
859
+
* PaliGemma, PaliGemma 2
860
860
* T + I<sup>E</sup>
861
861
*`google/paligemma-3b-pt-224`, `google/paligemma-3b-mix-224`, `google/paligemma2-3b-ft-docci-448`, etc.
862
862
*
863
863
* ✅︎
864
-
*✅︎
864
+
*⚠️
865
865
-*`Phi3VForCausalLM`
866
866
* Phi-3-Vision, Phi-3.5-Vision
867
867
* T + I<sup>E+</sup>
@@ -926,34 +926,15 @@ See [this page](#generative-models) for more information on how to use generativ
926
926
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
927
927
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
928
928
929
-
:::{warning}
930
-
vLLM does not currently support PrefixLM attention mask, so our PaliGemma implementation uses regular causal attention, which causes the model output to be unstable.
931
-
932
-
We may deprecate this model series in a future release.
933
-
:::
934
-
935
-
:::{note}
936
-
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support backends other than FlashAttention.
937
-
:::
938
-
939
-
:::{note}
940
-
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
941
-
:::
942
-
943
-
:::{note}
944
-
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
945
-
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
946
-
:::
947
-
948
-
:::{note}
949
-
To use Qwen2.5-VL series models, you have to install Hugging Face Transformers library from source via `pip install git+https://github.com/huggingface/transformers`.
950
-
:::
951
-
952
-
:::{note}
929
+
:::{important}
953
930
To use Gemma3 series models, you have to install Hugging Face Transformers library from source via
The earliest commit that supports this is [`50d3530aa04e7a7d003e6b255a98f79fd0447357`](https://github.com/huggingface/transformers/commit/50d3530aa04e7a7d003e6b255a98f79fd0447357).
956
932
933
+
Pan-and-scan image pre-processing is currently supported on V0 (but not V1).
934
+
You can enable it by passing `--mm-processor-kwargs '{"do_pan_and_scan": True}'`.
935
+
:::
936
+
937
+
:::{warning}
957
938
Both V0 and V1 support `Gemma3ForConditionalGeneration` for text-only inputs.
958
939
However, there are differences in how they handle text + image inputs:
959
940
@@ -969,9 +950,23 @@ V1 currently uses a simplified attention pattern:
969
950
- Will be updated in the future to support the correct behavior
970
951
971
952
This limitation exists because the model's mixed attention pattern (bidirectional for images, causal otherwise) is not yet supported by vLLM's attention backends.
953
+
:::
954
+
955
+
:::{note}
956
+
`h2oai/h2ovl-mississippi-2b` will be available in V1 once we support backends other than FlashAttention.
957
+
:::
972
958
973
-
Additionally, vLLM's current Gemma 3 implementation does not support the pan-and-scan image pre-processing algorithm, which helps handle images with skewed aspect ratios by intelligently cropping them into multiple views.
974
-
Without this feature, model performance may degrade when processing images that deviate significantly from square dimensions.
959
+
:::{note}
960
+
To use `TIGER-Lab/Mantis-8B-siglip-llama3`, you have to pass `--hf_overrides '{"architectures": ["MantisForConditionalGeneration"]}'` when running vLLM.
961
+
:::
962
+
963
+
:::{note}
964
+
The official `openbmb/MiniCPM-V-2` doesn't work yet, so we need to use a fork (`HwwwH/MiniCPM-V-2`) for now.
965
+
For more details, please see: <gh-pr:4087#issuecomment-2250397630>
966
+
:::
967
+
968
+
:::{warning}
969
+
Our PaliGemma implementations have the same problem as Gemma 3 (see above) for both V0 and V1.
0 commit comments