Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Initial support for LLaVA-NeXT #4199

Merged
merged 178 commits into from
Jun 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
178 commits
Select commit Hold shift + click to select a range
a26badd
Support image processor
DarkLight1337 Apr 19, 2024
adf2b94
Support image processor
DarkLight1337 Apr 19, 2024
ea4f8ed
Add LLaVA-NeXT architecture
DarkLight1337 Apr 19, 2024
1a0ecca
Convert dtype in multi modal processing
DarkLight1337 Apr 19, 2024
45b6756
Move `MultiModalData` to new subpackage `multimodal`
DarkLight1337 Apr 22, 2024
6ed8397
Add multi-modal processor registry
DarkLight1337 Apr 22, 2024
8c48208
Initialize the processor only once
DarkLight1337 Apr 22, 2024
613ec1b
Merge branch 'upstream' into mm-data-processor
DarkLight1337 Apr 22, 2024
c48a7d4
Move processor to model runner
DarkLight1337 Apr 22, 2024
3232231
Refactor registry to plugin pattern in order to support specifying du…
DarkLight1337 Apr 23, 2024
92a0283
Merge branch 'upstream' into mm-data-processor
DarkLight1337 Apr 23, 2024
5d42800
Combine prompt inputs
DarkLight1337 Apr 24, 2024
5db2c5e
Fix a bunch of tests
DarkLight1337 Apr 25, 2024
74c5905
Fix LLaVA test
DarkLight1337 Apr 25, 2024
cd8917b
Merge branch 'upstream' into llm-inputs
DarkLight1337 Apr 25, 2024
b49aba7
Fix `benchmark_latency` test
DarkLight1337 Apr 25, 2024
bfd7295
Merge branch 'upstream' into llm-inputs
DarkLight1337 Apr 25, 2024
45c7f23
Merge branch 'upstream' into llm-inputs
DarkLight1337 Apr 27, 2024
493e6ed
Merge branch 'upstream' into llm-inputs
DarkLight1337 Apr 28, 2024
df1b20b
Merge branch 'upstream' into mm-data-processor
DarkLight1337 Apr 28, 2024
20aeceb
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 1, 2024
0f46653
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 3, 2024
c4f3540
Clarify tokenizer usage
DarkLight1337 May 3, 2024
ab8182c
Rename `encode_request -> process_model_inputs`
DarkLight1337 May 3, 2024
eac33e1
Support old API in `LLM.generate`
DarkLight1337 May 3, 2024
0ff8189
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 3, 2024
9663b50
Fix import error
DarkLight1337 May 3, 2024
703d318
Add tests to ensure old API still works
DarkLight1337 May 3, 2024
19d85f9
Let all entrypoints tests be run at the same time
DarkLight1337 May 3, 2024
0cf2dbe
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 4, 2024
554e8c5
Apply formatter
DarkLight1337 May 4, 2024
0921bad
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 7, 2024
baebd99
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 7, 2024
2cc5498
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 8, 2024
dc9816f
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 8, 2024
1c50600
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 14, 2024
5759dfa
Add tests for LLM.encode and fix corresponding bugs
DarkLight1337 May 14, 2024
cc4bfb5
Apply formatter
DarkLight1337 May 14, 2024
6085b08
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 14, 2024
d5c9731
Rename `_add_requests` to `_validate_and_add_requests` to be more sim…
DarkLight1337 May 14, 2024
4f218a5
Separate `entrypoints` tests into two groups
DarkLight1337 May 14, 2024
428df48
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 14, 2024
f153450
Remove duplicate comment
DarkLight1337 May 14, 2024
a9201d0
Fix memory profiling error
DarkLight1337 May 14, 2024
ceebfa6
Fix memory usage for embedding server
DarkLight1337 May 15, 2024
7d991cd
Update embeddings API to use new imputs
DarkLight1337 May 15, 2024
0e79dfb
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 15, 2024
b867b5e
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 15, 2024
2c0d58f
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 15, 2024
26f7253
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 15, 2024
d553693
Apply formatter
DarkLight1337 May 15, 2024
48e7a4a
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 16, 2024
595654c
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 20, 2024
b6c0e29
Merge branch 'upstream' into llm-inputs
DarkLight1337 May 20, 2024
e055472
Avoid duplicate `Tensor.to` calls
DarkLight1337 May 20, 2024
3097582
Merge `llm` groups back into one by enabling gc
DarkLight1337 May 20, 2024
9fe9bed
Add test for image pixel processor
DarkLight1337 May 20, 2024
222cb90
Improve CLI args
DarkLight1337 May 20, 2024
33294d5
Rename `multi_modal_datas` parameter
DarkLight1337 May 20, 2024
31cedac
Rename `input_processor` to be more explicit
DarkLight1337 May 20, 2024
21a0218
Rename `multi_modal_data` to be more explicit
DarkLight1337 May 20, 2024
32ae773
Remove patch for LLaVA-NeXT
DarkLight1337 May 20, 2024
78450eb
Apply formatter
DarkLight1337 May 20, 2024
f4defe6
Apply multi-modal refactor to `CPUModelRunner`
DarkLight1337 May 20, 2024
c43173b
Fix multi-modal handling in `EmbeddingModelRunner`
DarkLight1337 May 20, 2024
4c8e64e
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 20, 2024
ce58b25
Move dummy image data generation to model-agnostic file
DarkLight1337 May 20, 2024
d81f9f1
Add multimodal docs
DarkLight1337 May 20, 2024
7bbd123
Improve documentation for LLM/engine
DarkLight1337 May 20, 2024
056eb61
Direct readers to the `PromptInputs` class
DarkLight1337 May 22, 2024
b3b990a
Separate `_run_engine` from `_validate_and_add_requests`
DarkLight1337 May 22, 2024
2169def
Add flag for deprecating legacy API
DarkLight1337 May 22, 2024
3dbded1
Add tests for `deprecate_kwargs`
DarkLight1337 May 22, 2024
8e20317
Apply formatter
DarkLight1337 May 22, 2024
fdccaa2
Rename attribute to be less misleading
DarkLight1337 May 22, 2024
77ee1c8
Renable using `'fork'` start method and improve speed by using `torch…
DarkLight1337 May 23, 2024
b1bcdd1
Simplify logic of casting request output
DarkLight1337 May 23, 2024
44b4681
Improve code readability
DarkLight1337 May 23, 2024
50343cb
Fix `multi_modal_data` being a required key
DarkLight1337 May 23, 2024
45aa420
Fix index out of range error
DarkLight1337 May 23, 2024
d4e2589
Use a flag to control whether to check output types
DarkLight1337 May 23, 2024
c07b579
Simplify flags
DarkLight1337 May 23, 2024
9d56eb0
Move output validation to a more appropriate location
DarkLight1337 May 23, 2024
bc05031
Add message to deprecation notice
DarkLight1337 May 23, 2024
95d4130
Apply formatter
DarkLight1337 May 23, 2024
cc84f65
Remove unused parameter in `_validate_and_add_requests` and fix test
DarkLight1337 May 24, 2024
6c5d4a6
Simplify code
DarkLight1337 May 25, 2024
fd2da12
Move attribute assignment outside `_init_tokenizer`
DarkLight1337 May 25, 2024
d78de94
Only emit warning once
DarkLight1337 May 25, 2024
8a86829
Simplify assignment expression
DarkLight1337 May 25, 2024
731ac0e
Place special case at the start
DarkLight1337 May 25, 2024
2d1a0bc
move API reference to under developer doc
ywang96 May 25, 2024
7b8ce2c
Fix links in docs
DarkLight1337 May 26, 2024
fff21a1
Remove unnecessary code to avoid repeated warning
DarkLight1337 May 26, 2024
82233ec
Merge branch 'llm-inputs' into mm-data-processor
DarkLight1337 May 27, 2024
797e8a5
Simplify code and fix type annotations
DarkLight1337 May 17, 2024
e10b3fc
Update docs
DarkLight1337 May 27, 2024
c6a9fcf
Use intersphinx and avoid long default values
DarkLight1337 May 27, 2024
a26e1e3
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 27, 2024
883bea4
Apply formatter
DarkLight1337 May 27, 2024
46bc1ea
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 29, 2024
d350bb3
Fix bad merge
DarkLight1337 May 29, 2024
2a166a7
Do not support multiple multimodal data in legacy API
DarkLight1337 May 29, 2024
db12c29
Reinstate whitespace
DarkLight1337 May 29, 2024
4a0a85c
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 29, 2024
6529280
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 30, 2024
dc6c5fd
Fix bad config dict
DarkLight1337 May 30, 2024
2ed2fdc
Fix tests
DarkLight1337 May 30, 2024
8d09112
Apply formatter
DarkLight1337 May 30, 2024
3fe1f61
Remove `multi_modal_data` support in legacy API
DarkLight1337 May 30, 2024
46af1ac
Add NOTE and TODO
DarkLight1337 May 30, 2024
f620a1b
Add missing type annotations
DarkLight1337 May 30, 2024
70b4165
Rename functions
DarkLight1337 May 30, 2024
87c2da4
Add NOTE
DarkLight1337 May 30, 2024
7fc620c
Fix multimodal inputs being on wrong device
DarkLight1337 May 30, 2024
cd63022
Rename `MM_REGISTRY` to be more explicit
DarkLight1337 May 30, 2024
19fea82
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 30, 2024
43f2660
fix upstream merge
ywang96 May 30, 2024
5d3a063
Merge branch 'upstream' into mm-data-processor
DarkLight1337 May 31, 2024
b6754a4
Enable passing tensor directly as image
DarkLight1337 May 31, 2024
01b0512
Add pillow to intersphinx and fix quote format
DarkLight1337 May 31, 2024
a996b34
Fix mock imports
DarkLight1337 May 31, 2024
52ed274
Trigger pipeline
DarkLight1337 May 31, 2024
559bd46
Automatically convert dtype
DarkLight1337 May 31, 2024
69c4ff6
Comment out failing test for now
DarkLight1337 May 31, 2024
960e5eb
Fix blank pages in docs
DarkLight1337 May 31, 2024
a3c6fdb
Use the module name, not package name
DarkLight1337 May 31, 2024
d78d456
Trigger pipeline
DarkLight1337 May 31, 2024
243eb90
Trigger pipeline 2
DarkLight1337 May 31, 2024
501b11c
Fix formatting [skip ci]
DarkLight1337 May 31, 2024
3d20f6d
Merge branch 'upstream' into mm-data-processor
DarkLight1337 Jun 3, 2024
680cee9
Merge branch 'upstream' into mm-data-processor
DarkLight1337 Jun 3, 2024
2f0178b
Merge branch 'mm-data-processor' into llava-next
DarkLight1337 Jun 3, 2024
dd461f3
Fix bad merge
DarkLight1337 Jun 3, 2024
91dc8a9
Fix bad merge
DarkLight1337 Jun 3, 2024
6ae4fc1
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 3, 2024
89930a4
Run LLaVA-NeXT tests in CI
DarkLight1337 Jun 4, 2024
95c0469
Simplify test specification
DarkLight1337 Jun 4, 2024
456c180
Fix unable to initialize LLaVA-NeXT model
DarkLight1337 Jun 4, 2024
411eeb3
Fix OOM when loading LLaVA-NeXT on HuggingFace
DarkLight1337 Jun 4, 2024
93384b9
Fix LLaVA-NeXT not using multimodal registry
DarkLight1337 Jun 4, 2024
3a5bf29
Improve error message
DarkLight1337 Jun 4, 2024
193daa8
Fix `image_sizes` being missing when tensor is passed directly
DarkLight1337 Jun 4, 2024
3f3eccf
Fix incorrect dummy data
DarkLight1337 Jun 4, 2024
4ca713e
Add validation for `image_sizes`
DarkLight1337 Jun 4, 2024
6b8b850
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 4, 2024
abd76a0
Fix model not being able to be split across GPUs
DarkLight1337 Jun 4, 2024
7b8a3df
Fix wrong shape
DarkLight1337 Jun 4, 2024
930aa4b
Test LLaVA-NeXT processor
DarkLight1337 Jun 4, 2024
cd60af8
Remove unnecessary `worker_use_ray`
DarkLight1337 Jun 4, 2024
d843b0b
Fix incorrect template for LLaVA(-NeXT) tests
DarkLight1337 Jun 4, 2024
cdb0699
Clean up model loading
DarkLight1337 Jun 4, 2024
7ea733a
Use a smaller LLaVA-NeXT model for testing
DarkLight1337 Jun 4, 2024
b5fbe46
Improve repr for easier debugging
DarkLight1337 Jun 4, 2024
246bf1b
Revert `device_map="auto"` since the model can fit in one GPU now
DarkLight1337 Jun 4, 2024
02f3ef5
Fix insufficient `max_model_len`
DarkLight1337 Jun 4, 2024
bc03534
Apply formatter
DarkLight1337 Jun 4, 2024
0586af9
Resize image to match the required number of tokens
DarkLight1337 Jun 4, 2024
7adcc79
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 4, 2024
0cd4e25
Remove unnecessary gc
DarkLight1337 Jun 4, 2024
52e12cb
Remove `tp>1` test as it caused ray workers to hang at the end
DarkLight1337 Jun 4, 2024
ac3162f
Add xfail
DarkLight1337 Jun 4, 2024
8032ba9
Fix broken CI template
DarkLight1337 Jun 5, 2024
556e3fd
Also xfail LLaVA-NeXT processor
DarkLight1337 Jun 5, 2024
ec24033
Disallow image features in LLaVA-NeXT
DarkLight1337 Jun 5, 2024
4d1ce23
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 6, 2024
c122409
Add reference
DarkLight1337 Jun 8, 2024
4d40449
Move input type check to initialization time
DarkLight1337 Jun 8, 2024
c748dd9
Add warning when image is resized
DarkLight1337 Jun 8, 2024
f235732
Avoid model inheritance
DarkLight1337 Jun 8, 2024
935a7f9
Apply formatter
DarkLight1337 Jun 8, 2024
f1dd1e3
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 8, 2024
e586b81
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 8, 2024
9afe5b4
Also use context manager in LLaVA-NeXT test
DarkLight1337 Jun 8, 2024
fa89a22
update supported models
ywang96 Jun 10, 2024
2df8398
Remove asterisk
DarkLight1337 Jun 10, 2024
23cb8fa
Use proper capitalization
DarkLight1337 Jun 10, 2024
1ed7bf2
Merge branch 'upstream' into llava-next
DarkLight1337 Jun 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion docs/source/models/supported_models.rst
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,11 @@ Alongside each architecture, we include some popular models that use it.
- ✅︎
* - :code:`LlavaForConditionalGeneration`
- LLaVA-1.5
- :code:`llava-hf/llava-1.5-7b-hf`\*, :code:`llava-hf/llava-1.5-13b-hf`\*, etc.
- :code:`llava-hf/llava-1.5-7b-hf`, :code:`llava-hf/llava-1.5-13b-hf`, etc.
-
* - :code:`LlavaNextForConditionalGeneration`
- LLaVA-NeXT
- :code:`llava-hf/llava-v1.6-mistral-7b-hf`, :code:`llava-hf/llava-v1.6-vicuna-7b-hf`, etc.
-
* - :code:`MiniCPMForCausalLM`
- MiniCPM
Expand Down
2 changes: 0 additions & 2 deletions tests/models/test_llava.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,6 @@ def iter_llava_configs(model_name: str):

model_and_vl_config = [
*iter_llava_configs("llava-hf/llava-1.5-7b-hf"),
# Not enough memory
# *iter_llava_configs("llava-hf/llava-1.5-13b-hf"),
]


Expand Down
123 changes: 123 additions & 0 deletions tests/models/test_llava_next.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
from typing import List, Tuple

import pytest
from transformers import AutoTokenizer

from vllm.config import VisionLanguageConfig

from ..conftest import IMAGE_FILES

pytestmark = pytest.mark.llava

_PREFACE = (
"A chat between a curious human and an artificial intelligence assistant. "
"The assistant gives helpful, detailed, and polite answers to the human's "
"questions.")

# The image token is placed before "user" on purpose so that the test can pass
HF_IMAGE_PROMPTS = [
f"{_PREFACE} <image>\nUSER: What's the content of the image? ASSISTANT:",
f"{_PREFACE} <image>\nUSER: What is the season? ASSISTANT:",
]

assert len(HF_IMAGE_PROMPTS) == len(IMAGE_FILES)


def iter_llava_next_configs(model_name: str):
image_hw_to_feature_size = {
(336, 336): 1176,
(672, 672): 2928,
(1344, 336): 1944,
(336, 1344): 1890,
}

for (h, w), f in image_hw_to_feature_size.items():
for input_type, input_shape in [
(VisionLanguageConfig.ImageInputType.PIXEL_VALUES, (1, 3, h, w)),
]:
yield (model_name,
VisionLanguageConfig(image_input_type=input_type,
image_feature_size=f,
image_token_id=32000,
image_input_shape=input_shape,
image_processor=model_name,
image_processor_revision=None))


model_and_vl_config = [
*iter_llava_next_configs("llava-hf/llava-v1.6-vicuna-7b-hf"),
]


def vllm_to_hf_output(vllm_output: Tuple[List[int], str],
vlm_config: VisionLanguageConfig, model_id: str):
"""Sanitize vllm output to be comparable with hf output.
The function reduces `input_ids` from 1, 32000, 32000, ..., 32000,
x1, x2, x3 ... to 1, 32000, x1, x2, x3 ...
It also reduces `output_str` from "<image><image>bla" to "bla".
"""
input_ids, output_str = vllm_output
image_token_id = vlm_config.image_token_id

tokenizer = AutoTokenizer.from_pretrained(model_id)
image_token_str = tokenizer.decode(image_token_id)

hf_input_ids = [
input_id for idx, input_id in enumerate(input_ids)
if input_id != image_token_id or input_ids[idx - 1] != image_token_id
]
hf_output_str = output_str \
.replace(image_token_str * vlm_config.image_feature_size, " ")

return hf_input_ids, hf_output_str


@pytest.mark.xfail(
reason="Inconsistent image processor being used due to lack "
"of support for dynamic image token replacement")
@pytest.mark.parametrize("model_and_config", model_and_vl_config)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [128])
def test_models(hf_runner, vllm_runner, hf_images, vllm_images,
model_and_config, dtype: str, max_tokens: int) -> None:
"""Inference result should be the same between hf and vllm.
All the image fixtures for the test is under tests/images.
For huggingface runner, we provide the PIL images as input.
For vllm runner, we provide MultiModalData objects and corresponding
vision language config as input.
Note, the text input is also adjusted to abide by vllm contract.
The text output is sanitized to be able to compare with hf.
"""
model_id, vlm_config = model_and_config

with hf_runner(model_id, dtype=dtype, is_vision_model=True) as hf_model:
hf_outputs = hf_model.generate_greedy(HF_IMAGE_PROMPTS,
max_tokens,
images=hf_images)

vllm_image_prompts = [
p.replace("<image>", "<image>" * vlm_config.image_feature_size)
for p in HF_IMAGE_PROMPTS
]

with vllm_runner(
model_id,
dtype=dtype,
# should be greater than image_feature_size
max_model_len=4096,
enforce_eager=True,
**vlm_config.as_cli_args_dict(),
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(vllm_image_prompts,
max_tokens,
images=vllm_images)

for i in range(len(HF_IMAGE_PROMPTS)):
hf_output_ids, hf_output_str = hf_outputs[i]
vllm_output_ids, vllm_output_str = vllm_to_hf_output(
vllm_outputs[i], vlm_config, model_id)
assert hf_output_str == vllm_output_str, (
f"Test{i}:\nHF: {hf_output_str!r}\nvLLM: {vllm_output_str!r}")
assert hf_output_ids == vllm_output_ids, (
f"Test{i}:\nHF: {hf_output_ids}\nvLLM: {vllm_output_ids}")
62 changes: 55 additions & 7 deletions tests/multimodal/test_processor.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import numpy as np
import pytest
from transformers import CLIPImageProcessor
from transformers import CLIPImageProcessor, LlavaNextImageProcessor

from vllm.config import ModelConfig, VisionLanguageConfig
from vllm.multimodal import MULTIMODAL_REGISTRY
Expand All @@ -12,7 +12,7 @@
@pytest.mark.parametrize("dtype", ["half", "float"])
def test_clip_image_processor(hf_images, dtype):
MODEL_NAME = "llava-hf/llava-1.5-7b-hf"
IMAGE_HEIGHT = IMAGE_WIDTH = 33
IMAGE_HEIGHT = IMAGE_WIDTH = 560
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific reason why we changed this and why we changed to 560?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I originally made a typo (meant to be 336 instead of 33). But I should have made the image larger than that anyway to test whether HF does resizing in the same way.


hf_processor = CLIPImageProcessor.from_pretrained(MODEL_NAME)
assert isinstance(hf_processor, CLIPImageProcessor)
Expand Down Expand Up @@ -55,10 +55,61 @@ def test_clip_image_processor(hf_images, dtype):
assert np.allclose(hf_arr, vllm_arr), f"Failed for key={key}"


@pytest.mark.xfail(
reason="Inconsistent image processor being used due to lack "
"of support for dynamic image token replacement")
@pytest.mark.parametrize("dtype", ["half", "float"])
def test_llava_next_image_processor(hf_images, dtype):
MODEL_NAME = "llava-hf/llava-v1.6-34b-hf"
IMAGE_HEIGHT = IMAGE_WIDTH = 560

hf_processor = LlavaNextImageProcessor.from_pretrained(MODEL_NAME)
assert isinstance(hf_processor, LlavaNextImageProcessor)

model_config = ModelConfig(
model=MODEL_NAME,
tokenizer=MODEL_NAME,
tokenizer_mode="auto",
trust_remote_code=False,
seed=0,
dtype=dtype,
revision=None,
)
vlm_config = VisionLanguageConfig(
image_input_type=VisionLanguageConfig.ImageInputType.PIXEL_VALUES,
image_token_id=64000,
image_input_shape=(1, 3, IMAGE_HEIGHT, IMAGE_WIDTH),
image_feature_size=2928,
image_processor=MODEL_NAME,
image_processor_revision=None,
)

for image in hf_images:
hf_result = hf_processor.preprocess(
image,
return_tensors="pt",
).to(dtype=_STR_DTYPE_TO_TORCH_DTYPE[dtype])
vllm_result = MULTIMODAL_REGISTRY.process_input(
ImagePixelData(image),
model_config=model_config,
vlm_config=vlm_config,
)

assert hf_result.keys() == vllm_result.keys()
for key, hf_tensor in hf_result.items():
hf_arr: np.ndarray = hf_tensor.numpy()
vllm_arr: np.ndarray = vllm_result[key].numpy()

assert hf_arr.shape == vllm_arr.shape, f"Failed for key={key}"
assert np.allclose(hf_arr, vllm_arr), f"Failed for key={key}"


@pytest.mark.xfail(
reason="Example image pixels were not processed using HuggingFace")
@pytest.mark.parametrize("dtype", ["float"])
def test_image_pixel_types(hf_images, vllm_image_tensors, dtype):
MODEL_NAME = "llava-hf/llava-1.5-7b-hf"
IMAGE_HEIGHT = IMAGE_WIDTH = 33
IMAGE_HEIGHT = IMAGE_WIDTH = 560

model_config = ModelConfig(
model=MODEL_NAME,
Expand Down Expand Up @@ -95,7 +146,4 @@ def test_image_pixel_types(hf_images, vllm_image_tensors, dtype):
tensor_arr: np.ndarray = tensor_result[key].numpy()

assert image_arr.shape == tensor_arr.shape, f"Failed for key={key}"

# The examples in PR#3042 have slightly different preprocessing from
# HuggingFace's LlavaProcessor, causing the test to fail.
# assert np.allclose(image_arr, tensor_arr), f"Failed for key={key}"
assert np.allclose(image_arr, tensor_arr), f"Failed for key={key}"
2 changes: 2 additions & 0 deletions vllm/model_executor/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,8 @@
"LlamaForCausalLM": ("llama", "LlamaForCausalLM"),
"LlavaForConditionalGeneration":
("llava", "LlavaForConditionalGeneration"),
"LlavaNextForConditionalGeneration":
("llava_next", "LlavaNextForConditionalGeneration"),
# For decapoda-research/llama-*
"LLaMAForCausalLM": ("llama", "LlamaForCausalLM"),
"MistralForCausalLM": ("llama", "LlamaForCausalLM"),
Expand Down
18 changes: 10 additions & 8 deletions vllm/model_executor/models/llava.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
from typing import Iterable, List, Literal, Optional, Tuple, TypedDict, Union

import torch
from torch import nn
import torch.nn as nn
# TODO(xwjiang): We should port CLIPVisionModel's code over to not depend on
# transformers' impl.
from transformers import CLIPVisionModel, LlavaConfig
Expand Down Expand Up @@ -51,10 +51,10 @@ def forward(self, image_features: torch.Tensor) -> torch.Tensor:
return hidden_states


def _merge_vision_embeddings(input_ids: torch.Tensor,
inputs_embeds: torch.Tensor,
vision_embeddings: torch.Tensor,
image_token_id: int) -> torch.Tensor:
def merge_vision_embeddings(input_ids: torch.Tensor,
inputs_embeds: torch.Tensor,
vision_embeddings: torch.Tensor,
image_token_id: int) -> torch.Tensor:
"""In place merges in vision_embeddings with inputs_embeds."""
mask = (input_ids == image_token_id)

Expand Down Expand Up @@ -151,7 +151,8 @@ def _parse_and_validate_image_input(
return None

if not isinstance(pixel_values, torch.Tensor):
raise ValueError("Incorrect type of pixel values")
raise ValueError("Incorrect type of pixel values. "
f"Got type: {type(pixel_values)}")

return LlavaImagePixelInputs(
type="pixel_values",
Expand All @@ -166,7 +167,8 @@ def _parse_and_validate_image_input(
return None

if not isinstance(image_features, torch.Tensor):
raise ValueError("Incorrect type of image features")
raise ValueError("Incorrect type of image features. "
f"Got type: {type(image_features)}")

return LlavaImageFeatureInputs(
type="image_features",
Expand Down Expand Up @@ -268,7 +270,7 @@ def forward(
vision_embeddings = self._process_image_input(image_input)
inputs_embeds = self.language_model.get_input_embeddings(input_ids)

inputs_embeds = _merge_vision_embeddings(
inputs_embeds = merge_vision_embeddings(
input_ids, inputs_embeds, vision_embeddings,
self.vision_language_config.image_token_id)

Expand Down
Loading
Loading