Enabled and optimized GLM-4v-9b on Gaudi #691

gyou2021 · 2025-01-16T05:54:23Z

Enabled the multimodal model GLM-4v-9b on Gaudi.
Optimized the model:
Removed graph recompile caused by the image and different batch size;

Example:
python examples/offline_inference/vision_language.py -m glm4v

vllm/model_executor/models/glm4_vision_encoder.py

vllm/model_executor/models/chatglm.py

vllm/worker/hpu_model_runner.py

yma11 · 2025-02-24T10:52:20Z

@gyou2021 Recently vllm has some update about multi-modal support, especially they introduced InputProcessor for each model, like [VLM] Implement merged multimodal processor for Mllama. Will it affect your implementation and will you do a update based on this?

gyou2021 · 2025-03-11T08:27:52Z

@gyou2021 Recently vllm has some update about multi-modal support, especially they introduced InputProcessor for each model, like [VLM] Implement merged multimodal processor for Mllama. Will it affect your implementation and will you do a update based on this?

No. It won't affect the implementation. So no update based on this is needed.

jikunshang

LGTM.

vllm/model_executor/models/glm4v.py

vllm/model_executor/models/utils.py

Signed-off-by: gyou2021 <ganmei.you@intel.com>

michalkuligowski · 2025-03-28T10:53:49Z

/run-gaudi-tests

michalkuligowski · 2025-04-01T08:17:13Z

/run-gaudi-tests

This reverts commit c0e696b.

This PR reverts #691 that leads to `AttributeError: 'tuple' object has no attribute 'reshape'` for Qwen2.5-VL. ## Test server: ``` python -m vllm.entrypoints.openai.api_server --port 8080 --model Qwen/Qwen2.5-VL-3B-Instruct --tensor-parallel-size 1 --max-num-seqs 128 --dtype bfloat16 --gpu-memory-util 0.9 --max-num-batched-tokens 32768 --max-model-len 32768 --block-size 128 ``` Client: ``` python benchmark_serving.py --backend openai-chat --model Qwen/Qwen2.5-VL-3B-Instruct --trust-remote-code --port 8080 --endpoint /v1/chat/completions --dataset-path lmarena-ai/vision-arena-bench-v0.1 --dataset-name hf --hf-split train --num-prompts 40 --request-rate inf --seed 0 --ignore_eos ``` ### as is ``` ERROR 04-14 16:51:14 engine.py:139] AttributeError("'tuple' object has no attribute 'reshape'") ERROR 04-14 16:51:14 engine.py:139] Traceback (most recent call last): ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 137, in start ERROR 04-14 16:51:14 engine.py:139] self.run_engine_loop() ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 200, in run_engine_loop ERROR 04-14 16:51:14 engine.py:139] request_outputs = self.engine_step() ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 218, in engine_step ERROR 04-14 16:51:14 engine.py:139] raise e ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/multiprocessing/engine.py", line 209, in engine_step ERROR 04-14 16:51:14 engine.py:139] return self.engine.step() ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/engine/llm_engine.py", line 1380, in step ERROR 04-14 16:51:14 engine.py:139] outputs = self.model_executor.execute_model( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/executor/executor_base.py", line 138, in execute_model ERROR 04-14 16:51:14 engine.py:139] output = self.collective_rpc("execute_model", ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/executor/uniproc_executor.py", line 58, in collective_rpc ERROR 04-14 16:51:14 engine.py:139] answer = run_method(self.driver_worker, method, args, kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/utils.py", line 2323, in run_method ERROR 04-14 16:51:14 engine.py:139] return func(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/hpu_worker.py", line 294, in execute_model ERROR 04-14 16:51:14 engine.py:139] output = LocalOrDistributedWorkerBase.execute_model( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/worker_base.py", line 418, in execute_model ERROR 04-14 16:51:14 engine.py:139] output = self.model_runner.execute_model( ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context ERROR 04-14 16:51:14 engine.py:139] return func(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/hpu_model_runner.py", line 2697, in execute_model ERROR 04-14 16:51:14 engine.py:139] hidden_states = self.model.forward( ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 745, in forward ERROR 04-14 16:51:14 engine.py:139] return wrapped_hpugraph_forward( ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/habana_frameworks/torch/hpu/graphs.py", line 610, in wrapped_hpugraph_forward ERROR 04-14 16:51:14 engine.py:139] outputs = orig_fwd(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/worker/hpu_model_runner.py", line 423, in forward ERROR 04-14 16:51:14 engine.py:139] hidden_states = self.model(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1742, in _wrapped_call_impl ERROR 04-14 16:51:14 engine.py:139] return self._call_impl(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1848, in _call_impl ERROR 04-14 16:51:14 engine.py:139] return inner() ERROR 04-14 16:51:14 engine.py:139] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1796, in inner ERROR 04-14 16:51:14 engine.py:139] result = forward_call(*args, **kwargs) ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/qwen2_5_vl.py", line 1104, in forward ERROR 04-14 16:51:14 engine.py:139] inputs_embeds = self.get_input_embeddings_v0( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/qwen2_5_vl.py", line 1037, in get_input_embeddings_v0 ERROR 04-14 16:51:14 engine.py:139] inputs_embeds = merge_multimodal_embeddings( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/utils.py", line 448, in merge_multimodal_embeddings ERROR 04-14 16:51:14 engine.py:139] return _hpu_merge_multimodal_embeddings( ERROR 04-14 16:51:14 engine.py:139] File "/root/vllm-fork/vllm/model_executor/models/utils.py", line 674, in _hpu_merge_multimodal_embeddings ERROR 04-14 16:51:14 engine.py:139] multimodal_embeddings = multimodal_embeddings.reshape(-1, hidden_size) ``` ### with this PR ``` 100%|██████████| 1/1 [00:01<00:00, 1.14s/it] ============ Serving Benchmark Result ============ Successful requests: 1 Benchmark duration (s): 1.14 Total input tokens: 52 Total generated tokens: 128 Request throughput (req/s): 0.88 Output token throughput (tok/s): 112.62 Total Token throughput (tok/s): 158.37 ---------------Time to First Token---------------- Mean TTFT (ms): 169.75 Median TTFT (ms): 169.75 P99 TTFT (ms): 169.75 -----Time per Output Token (excl. 1st token)------ Mean TPOT (ms): 7.61 Median TPOT (ms): 7.61 P99 TPOT (ms): 7.61 ---------------Inter-token Latency---------------- Mean ITL (ms): 7.55 Median ITL (ms): 7.59 P99 ITL (ms): 8.10 ================================================== ``` Co-authored-by: Michał Kuligowski <mkuligowski@habana.ai>

gyou2021 requested review from afierka-intel, kzawora-intel, madamczyk-intel, mgawarkiewicz, michalkuligowski and vivekgoe as code owners January 16, 2025 05:54

gyou2021 mentioned this pull request Jan 16, 2025

Enabled and optimized the multimodal model GLM-4v-9b on Gaudi #676

Closed

michalkuligowski requested a review from jikunshang January 24, 2025 12:50

jikunshang reviewed Jan 26, 2025

View reviewed changes

vllm/model_executor/models/glm4_vision_encoder.py Outdated Show resolved Hide resolved

vllm/model_executor/models/chatglm.py Outdated Show resolved Hide resolved

vllm/worker/hpu_model_runner.py Outdated Show resolved Hide resolved

PatrykWo added the question Further information is requested label Feb 4, 2025

PatrykWo added New Model Issue o PR to enable a new model and removed question Further information is requested labels Feb 13, 2025

gyou2021 closed this Mar 10, 2025

gyou2021 force-pushed the glm4v branch from 7facc7f to 34ba9ed Compare March 10, 2025 05:45

gyou2021 reopened this Mar 11, 2025

jikunshang reviewed Mar 19, 2025

View reviewed changes

michalkuligowski requested changes Mar 21, 2025

View reviewed changes

vllm/model_executor/models/glm4v.py Outdated Show resolved Hide resolved

michalkuligowski requested changes Mar 24, 2025

View reviewed changes

vllm/model_executor/models/glm4v.py Outdated Show resolved Hide resolved

michalkuligowski requested changes Mar 27, 2025

View reviewed changes

vllm/model_executor/models/glm4v.py Outdated Show resolved Hide resolved

vllm/model_executor/models/utils.py Outdated Show resolved Hide resolved

gyou2021 requested a review from michalkuligowski March 28, 2025 03:44

michalkuligowski approved these changes Mar 28, 2025

View reviewed changes

gyou2021 added 4 commits March 28, 2025 11:52

Optimized glm4v on Gaudi HPU

cbcce35

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Refactored the code to prevent modifications to the model definition.

a708d97

Signed-off-by: gyou2021 <ganmei.you@intel.com>

Removed the model modification

846fccc

Signed-off-by: gyou2021 <ganmei.you@intel.com>

kept the model definition and polished the code.

b0e3234

Signed-off-by: gyou2021 <ganmei.you@intel.com>

michalkuligowski force-pushed the glm4v branch from 3ce2f7e to b0e3234 Compare March 28, 2025 10:52

Merge branch 'habana_main' into glm4v

a1d57b8

michalkuligowski merged commit c0e696b into HabanaAI:habana_main Apr 1, 2025
41 checks passed

imangohari1 added a commit to imangohari1/vllm-fork that referenced this pull request Apr 8, 2025

Revert "Enabled and optimized GLM-4v-9b on Gaudi (HabanaAI#691)"

6358192

This reverts commit c0e696b.

imangohari1 mentioned this pull request Apr 14, 2025

IG: fix multimodal reshape for Qwen2.5-VL (revet #691) #1081

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enabled and optimized GLM-4v-9b on Gaudi #691

Enabled and optimized GLM-4v-9b on Gaudi #691

Uh oh!

gyou2021 commented Jan 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yma11 commented Feb 24, 2025 •

edited

Loading

Uh oh!

gyou2021 commented Mar 11, 2025

Uh oh!

jikunshang left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michalkuligowski commented Mar 28, 2025

Uh oh!

michalkuligowski commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Enabled and optimized GLM-4v-9b on Gaudi #691

Enabled and optimized GLM-4v-9b on Gaudi #691

Uh oh!

Conversation

gyou2021 commented Jan 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yma11 commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gyou2021 commented Mar 11, 2025

Uh oh!

jikunshang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

michalkuligowski commented Mar 28, 2025

Uh oh!

michalkuligowski commented Apr 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

gyou2021 commented Jan 16, 2025 •

edited by github-actions bot

Loading

yma11 commented Feb 24, 2025 •

edited

Loading