Skip to content

Conversation

@gyou2021
Copy link

@gyou2021 gyou2021 commented Jan 16, 2025

  1. Enabled the multimodal model GLM-4v-9b on Gaudi.
  2. Optimized the model:
    Removed graph recompile caused by the image and different batch size;

Example:
python examples/offline_inference/vision_language.py -m glm4v

@PatrykWo PatrykWo added the question Further information is requested label Feb 4, 2025
@PatrykWo PatrykWo added New Model Issue o PR to enable a new model and removed question Further information is requested labels Feb 13, 2025
@yma11
Copy link

yma11 commented Feb 24, 2025

@gyou2021 Recently vllm has some update about multi-modal support, especially they introduced InputProcessor for each model, like [VLM] Implement merged multimodal processor for Mllama. Will it affect your implementation and will you do a update based on this?

@gyou2021
Copy link
Author

@gyou2021 Recently vllm has some update about multi-modal support, especially they introduced InputProcessor for each model, like [VLM] Implement merged multimodal processor for Mllama. Will it affect your implementation and will you do a update based on this?

No. It won't affect the implementation. So no update based on this is needed.

Copy link

@jikunshang jikunshang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
Signed-off-by: gyou2021 <ganmei.you@intel.com>
@michalkuligowski
Copy link

/run-gaudi-tests

@michalkuligowski
Copy link

/run-gaudi-tests

@michalkuligowski michalkuligowski merged commit c0e696b into HabanaAI:habana_main Apr 1, 2025
41 checks passed
imangohari1 added a commit to imangohari1/vllm-fork that referenced this pull request Apr 8, 2025
michalkuligowski added a commit that referenced this pull request Apr 16, 2025
This PR reverts #691 that leads to `AttributeError: 'tuple' object has
no attribute 'reshape'` for Qwen2.5-VL.

## Test 
server: 
```
python -m vllm.entrypoints.openai.api_server --port 8080 --model Qwen/Qwen2.5-VL-3B-Instruct --tensor-parallel-size 1 --max-num-seqs 128 --dtype bfloat16 --gpu-memory-util 0.9 --max-num-batched-tokens 32768 --max-model-len 32768 --block-size 128
```
Client: 
```
python benchmark_serving.py --backend openai-chat --model Qwen/Qwen2.5-VL-3B-Instruct --trust-remote-code --port 8080 --endpoint /v1/chat/completions --dataset-path lmarena-ai/vision-arena-bench-v0.1 --dataset-name hf --hf-split train --num-prompts 40 --request-rate inf --seed 0 --ignore_eos
```

### as is 
```
ERROR 04-14 16:51:14 engine.py:139] AttributeError("'tuple' object has no attribute 'reshape'")
ERROR 04-14 16:51:14 engine.py:139] Traceback (most recent call last):
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 137, in start
ERROR 04-14 16:51:14 engine.py:139]     self.run_engine_loop()
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 200, in run_engine_loop
ERROR 04-14 16:51:14 engine.py:139]     request_outputs = self.engine_step()
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 218, in engine_step
ERROR 04-14 16:51:14 engine.py:139]     raise e
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 209, in engine_step
ERROR 04-14 16:51:14 engine.py:139]     return self.engine.step()
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/engine/llm_engine.py", line 1380, in step
ERROR 04-14 16:51:14 engine.py:139]     outputs = self.model_executor.execute_model(
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/executor/executor_base.py", line 138, in execute_model
ERROR 04-14 16:51:14 engine.py:139]     output = self.collective_rpc("execute_model",
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
ERROR 04-14 16:51:14 engine.py:139]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/utils.py", line 2323, in run_method
ERROR 04-14 16:51:14 engine.py:139]     return func(*args, **kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/worker/hpu_worker.py", line 294, in execute_model
ERROR 04-14 16:51:14 engine.py:139]     output = LocalOrDistributedWorkerBase.execute_model(
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/worker/worker_base.py", line 418, in execute_model
ERROR 04-14 16:51:14 engine.py:139]     output = self.model_runner.execute_model(
ERROR 04-14 16:51:14 engine.py:139]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
ERROR 04-14 16:51:14 engine.py:139]     return func(*args, **kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/worker/hpu_model_runner.py", line 2697, in execute_model
ERROR 04-14 16:51:14 engine.py:139]     hidden_states = self.model.forward(
ERROR 04-14 16:51:14 engine.py:139]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 745, in forward
ERROR 04-14 16:51:14 engine.py:139]     return wrapped_hpugraph_forward(
ERROR 04-14 16:51:14 engine.py:139]   File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 610, in wrapped_hpugraph_forward
ERROR 04-14 16:51:14 engine.py:139]     outputs = orig_fwd(*args, **kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/worker/hpu_model_runner.py", line 423, in forward
ERROR 04-14 16:51:14 engine.py:139]     hidden_states = self.model(*args, **kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1742, in _wrapped_call_impl
ERROR 04-14 16:51:14 engine.py:139]     return self._call_impl(*args, **kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl
ERROR 04-14 16:51:14 engine.py:139]     return inner()
ERROR 04-14 16:51:14 engine.py:139]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1796, in inner
ERROR 04-14 16:51:14 engine.py:139]     result = forward_call(*args, **kwargs)
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/model_executor/models/qwen2_5_vl.py", line 1104, in forward
ERROR 04-14 16:51:14 engine.py:139]     inputs_embeds = self.get_input_embeddings_v0(
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/model_executor/models/qwen2_5_vl.py", line 1037, in get_input_embeddings_v0
ERROR 04-14 16:51:14 engine.py:139]     inputs_embeds = merge_multimodal_embeddings(
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/model_executor/models/utils.py", line 448, in merge_multimodal_embeddings
ERROR 04-14 16:51:14 engine.py:139]     return _hpu_merge_multimodal_embeddings(
ERROR 04-14 16:51:14 engine.py:139]   File "/root/vllm-fork/vllm/model_executor/models/utils.py", line 674, in _hpu_merge_multimodal_embeddings
ERROR 04-14 16:51:14 engine.py:139]     multimodal_embeddings = multimodal_embeddings.reshape(-1, hidden_size)
```
### with this PR
```
100%|██████████| 1/1 [00:01<00:00,  1.14s/it]
============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  1.14      
Total input tokens:                      52        
Total generated tokens:                  128       
Request throughput (req/s):              0.88      
Output token throughput (tok/s):         112.62    
Total Token throughput (tok/s):          158.37    
---------------Time to First Token----------------
Mean TTFT (ms):                          169.75    
Median TTFT (ms):                        169.75    
P99 TTFT (ms):                           169.75    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.61      
Median TPOT (ms):                        7.61      
P99 TPOT (ms):                           7.61      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.55      
Median ITL (ms):                         7.59      
P99 ITL (ms):                            8.10      
==================================================
```

Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

New Model Issue o PR to enable a new model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants