[Bug]: Error loading microsoft/Phi-3.5-vision-instruct #7718

BabyChouSr · 2024-08-21T04:45:15Z

Your current environment

vllm version: Version: 0.5.4

🐛 Describe the bug

Repro command:

vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096

Error:

vllm serve microsoft/Phi-3.5-vision-instruct --trust-remote-code --max-model-len 4096
INFO 08-21 04:43:37 api_server.py:339] vLLM API server version 0.5.4
INFO 08-21 04:43:37 api_server.py:340] args: Namespace(model_tag='microsoft/Phi-3.5-vision-instruct', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='microsoft/Phi-3.5-vision-instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=4096, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None, dispatch_function=<function serve at 0x7206f7951750>)
WARNING 08-21 04:43:37 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 08-21 04:43:38 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='microsoft/Phi-3.5-vision-instruct', speculative_config=None, tokenizer='microsoft/Phi-3.5-vision-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=microsoft/Phi-3.5-vision-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-21 04:43:38 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-21 04:43:38 selector.py:54] Using XFormers backend.
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/usr/local/lib/python3.10/dist-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-21 04:43:39 model_runner.py:720] Starting to load model microsoft/Phi-3.5-vision-instruct...
INFO 08-21 04:43:39 selector.py:170] Cannot use FlashAttention-2 backend due to sliding window.
INFO 08-21 04:43:39 selector.py:54] Using XFormers backend.
INFO 08-21 04:43:40 weight_utils.py:225] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  1.84it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.35it/s]

INFO 08-21 04:43:42 model_runner.py:732] Loading model weights took 7.7498 GB
/usr/local/lib/python3.10/dist-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
Process Process-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 217, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, port)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 25, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(async_engine_args,
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 471, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 381, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 552, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 263, in __init__
    self._initialize_kv_caches()
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 362, in _initialize_kv_caches
    self.model_executor.determine_num_available_blocks())
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 94, in determine_num_available_blocks
    return self.driver_worker.determine_num_available_blocks()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 179, in determine_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 940, in profile_run
    self.execute_model(model_input, kv_caches, intermediate_tensors)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 1363, in execute_model
    hidden_or_intermediate_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/phi3v.py", line 532, in forward
    inputs_embeds = merge_vision_embeddings(input_ids, inputs_embeds,
  File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/utils.py", line 29, in merge_vision_embeddings
    raise ValueError(
ValueError: Attempted to assign 1 x 781 = 781 image tokens to 2653 placeholders

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2024-08-21T04:50:31Z

Can you check out #7710 and see if it fixes your issue?

berkecanrizai · 2024-08-26T16:43:40Z

@DarkLight1337 is this currently fixed?
I am still getting the same error with the Dockerfile.cpu in this tutorial.

DarkLight1337 · 2024-08-27T01:19:11Z

@DarkLight1337 is this currently fixed?
I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

berkecanrizai · 2024-08-27T08:03:03Z

@DarkLight1337 is this currently fixed?
I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

0.5.5.
I pulled from the source yesterday, so I assume that is the latest available version.
I also tried adding a separate RUN pip install vllm==0.5.5 into the Docker to make sure it also happens in latest release.

Text only inference works fine for me (just text messages without any image), but, still getting the following errors with the image inputs:

ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion

This also happens with the microsoft/Phi-3-vision-128k-instruct, not only microsoft/Phi-3.5-vision-instruct.

DarkLight1337 · 2024-08-27T08:05:28Z

@DarkLight1337 is this currently fixed?
I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

0.5.5. I pulled from the source yesterday, so I assume that is the latest available version. I also tried adding a separate RUN pip install vllm==0.5.5 into the Docker to make sure it also happens in latest release.

Text only inference works fine for me (just text messages without any image), but, still getting the following errors with the image inputs:
ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion
This also happens with the microsoft/Phi-3-vision-128k-instruct, not only microsoft/Phi-3.5-vision-instruct.

You may have to increase the max_model_len as multimodal tokens count towards the limit. Any excess tokens will be truncated.

berkecanrizai · 2024-08-27T10:20:32Z

@DarkLight1337 is this currently fixed?
I am still getting the same error with the Dockerfile.cpu in this tutorial.

Which version of vLLM are you using?

0.5.5. I pulled from the source yesterday, so I assume that is the latest available version. I also tried adding a separate RUN pip install vllm==0.5.5 into the Docker to make sure it also happens in latest release.
Text only inference works fine for me (just text messages without any image), but, still getting the following errors with the image inputs:
ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 64, in _log_task_completion
This also happens with the microsoft/Phi-3-vision-128k-instruct, not only microsoft/Phi-3.5-vision-instruct.
You may have to increase the max_model_len as multimodal tokens count towards the limit. Any excess tokens will be truncated.

I tried with larger max_model_len (80.000) as well as without limiting it, still getting the same error.
I get this error on a CPU only machine. I had been running it without any errors on another machine with a GPU.

berkecanrizai · 2024-08-27T10:32:18Z

(VllmWorkerProcess pid=234352) ERROR 08-27 10:30:28 multiproc_worker_utils.py:226]   File "/home/{USER_NAME}/miniforge3/envs/vllm2/lib/python3.10/site-packages/vllm-0.5.5+cpu-py3.10-linux-x86_64.egg/vllm/model_executor/models/utils.py", line 88, in merge_multimodal_embeddings
(VllmWorkerProcess pid=234352) ERROR 08-27 10:30:28 multiproc_worker_utils.py:226]     raise ValueError(
(VllmWorkerProcess pid=234352) ERROR 08-27 10:30:28 multiproc_worker_utils.py:226] ValueError: Attempted to assign 1921 = 1921 multimodal tokens to 0 placeholders

@DarkLight1337 this is the exact error I have. I get it in both the Docker and outside of the Docker.

DarkLight1337 · 2024-08-27T10:46:00Z

@Isotr0py since you have a CPU-only environment (and also implemented this model), can you help investigate this? Thanks!

Isotr0py · 2024-08-27T11:02:17Z

Ok, I will investigate this tonight.

berkecanrizai · 2024-08-27T11:04:16Z

Small addition @DarkLight1337 @Isotr0py ,

from vllm import LLM
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.utils import FlexibleArgumentParser

llm = LLM(
     model="microsoft/Phi-3.5-vision-instruct",
     trust_remote_code=True
 )

Image inputs work without any issues when I use the LLM as above with llm.generate..., however, OpenAI mimicking (python -m vllm.entrypoints.openai.api_server --model microsoft/Phi-3.5-vision-instruct --trust-remote-code) still fails with the error above.

DarkLight1337 · 2024-08-27T11:11:55Z

Please note that multi-image support is not supported yet for OpenAI-compatible server. Can you provide a minimum reproducible example?

berkecanrizai · 2024-08-27T11:17:32Z

Please note that multi-image support is not supported yet for OpenAI-compatible server. Can you provide a minimum reproducible example?

Sure, after running with the above instructions,

run the following:

from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8001/v1". #### make sure this port is correct, I changed it to 8001 in server
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="microsoft/Phi-3.5-vision-instruct",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {
                "type": "image_url",
                "image_url": {
                    "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
                },
            },
        ],
    }],
)
print("Chat response:", chat_response)

Isotr0py · 2024-08-27T15:42:26Z

@berkecanrizai I have created #7916 to fix this. Please take a look at this :)

berkecanrizai · 2024-08-27T17:11:53Z

@berkecanrizai I have created #7916 to fix this. Please take a look at this :)

Thanks, that was fast :D

BabyChouSr added the bug Something isn't working label Aug 21, 2024

DarkLight1337 mentioned this issue Aug 21, 2024

[Model] Fix Phi-3.5-vision-instruct 'num_crops' issue #7710

Merged

DarkLight1337 closed this as completed in #7710 Aug 22, 2024

Isotr0py mentioned this issue Aug 27, 2024

[Bugfix] Fix phi3v incorrect image_idx when using async engine #7916

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Error loading microsoft/Phi-3.5-vision-instruct #7718

[Bug]: Error loading microsoft/Phi-3.5-vision-instruct #7718

BabyChouSr commented Aug 21, 2024

DarkLight1337 commented Aug 21, 2024

berkecanrizai commented Aug 26, 2024

DarkLight1337 commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024 •

edited

Loading

DarkLight1337 commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024 •

edited

Loading

DarkLight1337 commented Aug 27, 2024 •

edited

Loading

Isotr0py commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

DarkLight1337 commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

Isotr0py commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

[Bug]: Error loading microsoft/Phi-3.5-vision-instruct #7718

[Bug]: Error loading microsoft/Phi-3.5-vision-instruct #7718

Comments

BabyChouSr commented Aug 21, 2024

Your current environment

🐛 Describe the bug

DarkLight1337 commented Aug 21, 2024

berkecanrizai commented Aug 26, 2024

DarkLight1337 commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024 • edited Loading

DarkLight1337 commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024 • edited Loading

DarkLight1337 commented Aug 27, 2024 • edited Loading

Isotr0py commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

DarkLight1337 commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

Isotr0py commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024

berkecanrizai commented Aug 27, 2024 •

edited

Loading

berkecanrizai commented Aug 27, 2024 •

edited

Loading

DarkLight1337 commented Aug 27, 2024 •

edited

Loading