Skip to content

[Bug]: Phi-4-Multimodal Attribute error LoRALRUCache #16569

@lhcavalcanti

Description

@lhcavalcanti

🐛 Describe the bug

vLLM Engine Version 0.

Parameters:

{"tensor_parallel_size": 1, "limit_mm_per_prompt": {"image": 2}, "max_seq_len_to_capture": 131072, "enable_lora": true, "max_loras":1, "max_lora_rank": 320, "lora_modules": {"vision": "vision-lora"}}
Attribute error in LoRALRUCache ``` INFO 04-14 05:43:09 [__init__.py:239] Automatically detected platform cuda. INFO 04-14 05:43:10 [config.py:209] Replacing legacy 'type' key with 'rope_type' INFO 04-14 05:43:17 [config.py:600] This model supports multiple tasks: {'classify', 'generate', 'score', 'reward', 'embed'}. Defaulting to 'generate'. WARNING 04-14 05:43:17 [arg_utils.py:1708] ['Phi4MMForCausalLM'] is not supported by the V1 Engine. Falling back to V0. WARNING 04-14 05:43:17 [arg_utils.py:1581] The model has a long context length (131072). This may causeOOM during the initial memory profiling phase, or result in low performance due to small KV cache size. Consider setting --max-model-len to a smaller value. INFO 04-14 05:43:17 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='/models/Phi-4-multimodal-instruct', speculative_config=None, tokenizer='/models/Phi-4-multimodal-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=default-model, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, INFO 04-14 05:43:18 [cuda.py:292] Using Flash Attention backend. INFO 04-14 05:43:19 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0 INFO 04-14 05:43:19 [model_runner.py:1110] Starting to load model /models/Phi-4-multimodal-instruct... INFO 04-14 05:43:19 [cuda.py:266] Cannot use FlashAttention-2 backend for head size 72. INFO 04-14 05:43:19 [cuda.py:289] Using XFormers backend. ```
INFO 04-14 05:43:22 [model_runner.py:1146] Model loading took 9.7031 GiB and 2.520181 seconds
[rank0]: Traceback (most recent call last):
[rank0]:   File "/code/score.py", line 260, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 680, in from_engine_args
[rank0]:     return async_engine_cls.from_vllm_config(
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 653, in from_vllm_config
[rank0]:     return cls(
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 608, in __init__
[rank0]:     self.engine = self._engine_class(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 284, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 433, in _initialize_kv_caches
[rank0]:     self.model_executor.determine_num_available_blocks())
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/executor/executor_base.py", line 103, in determine_num_available_blocks
[rank0]:     results = self.collective_rpc("determine_num_available_blocks")
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]:     answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/utils.py", line 2347, in run_method
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/worker/worker.py", line 229, in determine_num_available_blocks
[rank0]:     self.model_runner.profile_run()
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1243, in profile_run
[rank0]:     self._dummy_run(max_num_batched_tokens, max_num_seqs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1369, in _dummy_run
[rank0]:     self.execute_model(model_input, kv_caches, intermediate_tensors)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1697, in execute_model
[rank0]:     self.set_active_loras(model_input.lora_requests,
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 1385, in set_active_loras
[rank0]:     self.lora_manager.set_active_adapters(lora_requests, lora_mapping)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/lora/worker_manager.py", line 167, in set_active_adapters
[rank0]:     set_active_adapters_worker(requests, mapping, self._apply_adapters,
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/adapter_commons/utils.py", line 54, in set_active_adapters_worker
[rank0]:     apply_adapters_func(requests)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/lora/worker_manager.py", line 227, in _apply_adapters
[rank0]:     self.add_adapter(lora)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/lora/worker_manager.py", line 250, in add_adapter
[rank0]:     self._adapter_manager.activate_adapter(lora_request.lora_int_id)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/lora/models.py", line 752, in activate_adapter
[rank0]:     self._active_adapters.touch(lora_id)
[rank0]:   File "/opt/miniconda/envs/python39/lib/python3.9/site-packages/vllm/utils.py", line 275, in touch
[rank0]:     self._LRUCache__update(key)  # type: ignore
[rank0]: AttributeError: 'LoRALRUCache' object has no attribute '_LRUCache__update'
INFO 04-14 05:58:52 [multiproc_worker_utils.py:124] Killing local vLLM worker processes
[rank0]:[W414 05:58:52.627573283 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
/opt/miniconda/envs/python39/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown

Your current environment

``` INFO 04-14 05:48:25 [__init__.py:239] Automatically detected platform cuda. Collecting environment information... /opt/miniconda/envs/python39/lib/python3.9/site-packages/_distutils_hack/__init__.py:33: UserWarning: Setuptools is replacing distutils. warnings.warn("Setuptools is replacing distutils.") PyTorch version: 2.6.0+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-6.8.0-1026-azure-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.66
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe

Nvidia driver version: 550.120
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.8.1
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.8.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 1
Core(s) per socket: 48
Socket(s): 2
NUMA node(s): 4
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7V13 64-Core Processor
Stepping: 1
CPU MHz: 2445.441
BogoMIPS: 4890.88
Hypervisor vendor: Microsoft
Virtualization type: full
L1d cache: 3 MiB
L1i cache: 3 MiB
L2 cache: 48 MiB
L3 cache: 384 MiB
NUMA node0 CPU(s): 0-23
NUMA node1 CPU(s): 24-47
NUMA node2 CPU(s): 48-71
NUMA node3 CPU(s): 72-95
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves user_shstk clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm

Versions of relevant libraries:
[pip3] numpy==2.0.2
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==26.4.0
[pip3] torch==2.6.0
[pip3] torchaudio==2.6.0
[pip3] torchvision==0.21.0
[pip3] transformers==4.51.2
[pip3] triton==3.2.0
[conda] No relevant packages
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.8.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV12 SYS SYS NODE 0-23 0 N/A
GPU1 NV12 X SYS SYS SYS 24-47 1 N/A
GPU2 SYS SYS X NV12 SYS 48-71 2 N/A
GPU3 SYS SYS NV12 X SYS 72-95 3 N/A
NIC0 NODE SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0

NVIDIA_VISIBLE_DEVICES=all
CUBLAS_VERSION=12.1.0.26
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
NCCL_VERSION=2.17.1
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NVIDIA_PRODUCT_NAME=Triton Server
CUDA_VERSION=12.1.0.023
CUDNN_VERSION=8.8.1.3+cuda12.0
NVIDIA_TRITON_SERVER_VERSION=23.03
LD_LIBRARY_PATH=/opt/tritonserver/backends/onnxruntime:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-11/lib64
NVIDIA_BUILD_ID=56086596
CUDA_DRIVER_VERSION=530.30.02
NVIDIA_REQUIRE_JETPACK_HOST_MOUNTS=
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

</details>






Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions