[Draft][Gaudi][Model] Qwen2.5-VL optimization #1109

malkomes · 2025-04-16T16:47:33Z

WIP for Phase 2 Qwen2.5-VL optimization.
Details will be provided once ready.

--
Co-authored-by: Gustavo Malkomes gustavo.malkomes@intel.com
Co-authored-by: Jimin Ha jimin.ha@intel.com
Co-authored-by: Sayantan Sarkar sayantan.sarkar@intel.com
Co-authored-by: Iman Gohari s.m.iman.gohari@intel.com

This reverts commit c0e696b.

…utPR691

… changes

split warmup in text only and image only force input_positions in text to be 3, seq_len

full_attention_mask doesn't need to be created for each full attention layer, only create once and reuse. This can save memory and time.

profile_run takes maximum tensor size of 65K. To support it, we need to reduce significant memory usage by adding below. - Set disable_tensor_cache=True for vision model as well - Add additional mark_step to split the graphs - Move einsum operation to cpu for bigger tensor(due to GC error) - Run FusedSDPA for longer sequence as well

- fix use_graph to detect multimodal bucket correctly - pass the right pixel size for execution - change multimodal buckets to align with resize - remove multimodal warmup for Decode

sayantan-nervana · 2025-04-18T23:37:45Z

vllm/worker/hpu_model_runner.py

+            #self.multimodal_buckets = [1600, 3136, 4096, 6400, 7744, 9216, 12544, 16384, 26500, 40000, 65536]
+            self.multimodal_buckets = [1600, 3136, 4096, 6400, 7744, 9216, 12544]
+        else:
+            self.multimodal_buckets = [int(i) for i in envvar.split(',')]


add a "sorted" on this. The way get_multimodal_bucket works assumes sorted

check if max bucket size of incoming images is graphed

replace: max_pixels -> num_patches

sayantan-nervana · 2025-04-22T17:44:13Z

vllm/model_executor/models/qwen2_5_vl.py

+    range_to_max_for_each_img = torch.arange(maxsize, device=indices.device).unsqueeze(0).repeat(indices.shape[0]-1,1)
+    yy = range_to_max_for_each_img < indices[1:].unsqueeze(1)
+    zz = range_to_max_for_each_img >= indices[:-1].unsqueeze(1)
+    xx = torch.logical_and(yy, zz).float()


chage var names

imangohari1 · 2025-04-22T20:54:15Z

this PR upto the following commit has been tested for pytests, online tests and offline tests.
a03181d

sayantan-nervana · 2025-04-22T21:09:12Z

vllm/model_executor/models/qwen2_5_vl.py

+            f"[MM_BUCKETING] Padding current number pixel {pixel_values.shape[0]} to {desired_number_of_pixels}"
+        )
+        # needs to make sure padding_len is even
+        assert padding_len % 64 == 0, '[testing version] padding needs to be multiple of 64'


what does [testing version] mean?

sayantan-nervana · 2025-04-22T21:11:10Z

vllm/model_executor/models/qwen2_5_vl.py


        if video_input is not None:
+            if is_hpu:
+                print("Video inputs have not been enabled/verified yet, ignoring video inputs")


use logger.warning

sayantan-nervana · 2025-04-22T21:13:20Z

vllm/worker/hpu_model_runner.py

 _PAD_SLOT_ID = 0
 _PAD_BLOCK_ID = 0
-
+_UNSET_NUM_PATCHES = 9999999


I suppose we use this because "None" means something else? if not could we use None?

sayantan-nervana · 2025-04-22T21:13:58Z

vllm/worker/hpu_model_runner.py

+    def __init__(self):
+        envvar = os.environ.get('VLLM_MULTIMODAL_BUCKETS', "")
+        if envvar == "":
+            #TODO:with profile_run, the bucket of 65536 is added, so the pixel values


this statement is no longer true i think. we profile with largest bucket in this class

imangohari1

@sayantan-nervana
I have gone through the PR as of here, made some suggestions and comments.
I hope these help.

imangohari1 · 2025-04-23T18:34:53Z

tests/models/multimodal/processing/test_qwen2_5_vl.py

+
+from vllm.multimodal import MULTIMODAL_REGISTRY
+from vllm.multimodal.utils import cached_get_tokenizer
+# from vllm.model_executor.models.qwen2_5_vl import Qwen2_5_VLImageProcessorForceAlignment


we should remove this line?

imangohari1 · 2025-04-23T18:35:19Z

tests/models/multimodal/processing/test_qwen2_5_vl.py

+    expected_pixels_shape_one = 1176
+    expected_toks_per_img = expected_pixels_shape_zero // 4
+    mm_processor_kwargs = {}
+    #mm_processor_kwargs = {"force_alignment": True}


to be removed?

imangohari1 · 2025-04-23T18:35:54Z

vllm/model_executor/models/qwen2_5_vl.py

+        q_len = q.size(-2)
+        assert q_len % q_block_size == 0
+        q_tiles = (q_len // q_block_size) if (q_len % q_block_size == 0) else math.ceil(q_len / q_block_size)
+        #q_padding = q_tiles * q_block_size - q_len


shall we remove these commented lines?

vllm/model_executor/models/qwen2_5_vl.py

imangohari1 · 2025-04-23T18:40:09Z

vllm/model_executor/models/qwen2_5_vl.py

+            row_mask = mask[:, :, s:e, :]
+            attn_output[:, :, s:e, :] = FusedSDPA.apply(row_q, k, v, row_mask, 0.0, False, None)
+            #TODO: markstep every 10th layer, didn't experiment which one is optimal number.
+            #10,50,100 shows simliar result, without this, we see the program hangs for multiple prompts(with larger images)


Suggested change

#10,50,100 shows simliar result, without this, we see the program hangs for multiple prompts(with larger images)

#INFO: %10, 50, 100 show similar results. Without the mark_step here, the model hangs for multiple prompts and/or larger images

imangohari1 · 2025-04-23T18:46:06Z

vllm/worker/hpu_model_runner.py

-        return (batch_size, seq_len, is_prompt) in self.graphed_buckets
+        if not num_patches:
+            return (batch_size, seq_len, is_prompt) in self.graphed_buckets
+        #TODO: We might need to check both language bucket and multimodal bucket


still needed?

vllm/worker/hpu_model_runner.py

imangohari1 · 2025-04-23T18:47:35Z

vllm/worker/hpu_model_runner.py

+                                                    lora_request):
+        assert self.model_is_mrope, "Warmup compatible with Qwen2vl models"
+        if num_patches == _UNSET_NUM_PATCHES:
+            # # only half of the total number of tokens should be from image


can this section be cleaned up?

imangohari1 · 2025-04-23T18:47:54Z

vllm/worker/hpu_model_runner.py

+
+        image_h = int(math.sqrt(num_patches))
+        image_grid_thw = torch.tensor([1, image_h, image_h])
+        pixel_values = torch.randn(image_grid_thw.prod(), 1176)  # TODO: figure out the variable name


Suggested change

pixel_values = torch.randn(image_grid_thw.prod(), 1176) # TODO: figure out the variable name

pixel_values = torch.randn(image_grid_thw.prod(), 1176)

imangohari1 · 2025-04-23T18:48:56Z

vllm/model_executor/models/qwen2_5_vl.py

+    #TODO: einsum with tensor dimension too big doesn't work. Register max size error.
+    #We can always move to CPU for all einsum without shape checking if perf impact is minimal.
+    if range_indices.shape[-1] > 40000:
+        print("einsum running on CPU : ", range_indices.shape)


Suggested change

print("einsum running on CPU : ", range_indices.shape)

logger.info("einsum running on CPU : ", range_indices.shape)

sayantan-nervana · 2025-04-23T21:24:58Z

vllm/worker/hpu_model_runner.py

bug:

lets say, we want to warm up for bucket=42688

code:

image_h = int(math.sqrt(num_patches)) image_grid_thw = torch.tensor([1, image_h, image_h])

num_patches = 42688
sqrt is: 206.6
int trucate: 206
so now grid thw is 206x206
which is 42436

42436%64 != 0

proposed change, something like: image_grid_thw = torch.tensor([1, image_h, num_patches/image_h])

Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>

imangohari1 and others added 20 commits April 8, 2025 21:57

Revert "Enabled and optimized GLM-4v-9b on Gaudi (#691)"

6358192

This reverts commit c0e696b.

Merge branch 'HabanaAI:habana_main' into ig/habana_main_85c985e_witho…

e3d0fb2

…utPR691

fea(): Qwen2.5-vl upgrades. initial commit

a1097dd

fea(): Added the changes needed from hpu-extension #61

4cdc7d7

reverted the hup_model_runner to habana_main and added the qwen2.5-vl…

254ca6b

… changes

using max_pixels instead of h,w

e8d4c3e

clean up if/else

86e65fb

clean up if-else 2

4037e04

Fix cu_seqlens_now

de4e2c9

Remove pdb, fix shape

35df595

Remove breakpoints

37311b1

using max_pixels during warmup

10dfab9

split warmup in text only and image only force input_positions in text to be 3, seq_len

Video inputs ignored for now

2042673

Remove unused return_time

cfb7809

Add warning about 112 alignment

3cce3bb

Move VisionBuckets out to hpu model runner

1c4d44c

Create full attention mask outside of VisionTransformer

5bbfdeb

full_attention_mask doesn't need to be created for each full attention layer, only create once and reuse. This can save memory and time.

warmup multimoda graph with memory track?

151b3e3

we dont need this anymore

260e724

imangohari1 changed the title ~~Ig/qwen2 5 vl vision transformer~~ [Gaudi][Model] Qwen2.5-VL optimization Apr 17, 2025

ssarkar2 and others added 2 commits April 18, 2025 02:30

we dont need b dim

769cf6f

Fix use_graph to return correctly for multimodal buckets

79f65e0

- fix use_graph to detect multimodal bucket correctly - pass the right pixel size for execution - change multimodal buckets to align with resize - remove multimodal warmup for Decode

sayantan-nervana reviewed Apr 18, 2025

View reviewed changes

malkomes and others added 6 commits April 18, 2025 23:56

sort vision buckets

cf49203

check if max bucket size of incoming images is graphed

set input_positions in text to be (3, seq_len) for mrope models

bbe1571

linting

73fadb4

always compute embeddings for qwen2.5vl, even text

8aff501

simplify dummy_multi_modal

b2c020e

replace: max_pixels -> num_patches

Add VLLM_GRAPH_MULTIMODAL_PROMPT_RATIO

372e793

Merge branch 'habana_main' into ig/qwen2_5-vl_visionTransformer

a03181d

sayantan-nervana reviewed Apr 22, 2025

View reviewed changes

ssarkar2 and others added 2 commits April 22, 2025 20:55

Clean up some vars

458b9fa

Remove SPLIT flag for Qwen

36213b9

sayantan-nervana reviewed Apr 22, 2025

View reviewed changes

malkomes added 2 commits April 22, 2025 22:12

Using Qwen2_5_VisionTransformerStaticShape

69e3111

no need to change this

bdd279a

imangohari1 reviewed Apr 23, 2025

View reviewed changes

sayantan-nervana reviewed Apr 23, 2025

View reviewed changes

malkomes and others added 5 commits April 23, 2025 23:10

Update vllm/model_executor/models/qwen2_5_vl.py

2137634

Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>

Update vllm/worker/hpu_model_runner.py

14908e1

Co-authored-by: Iman Gohari <s.m.iman.gohari@intel.com>

working on comments

c0d0207

buckets needs to be multiples of 8

96189f9

ops

7c9cf4c

malkomes force-pushed the ig/qwen2_5-vl_visionTransformer branch from 63e356f to 7c9cf4c Compare April 23, 2025 23:25

jiminha added 2 commits April 24, 2025 04:09

Fix the multimodal warmup memory calculation

32d7855

Fixe print error for einsum

9a0ff19

malkomes changed the title ~~[Gaudi][Model] Qwen2.5-VL optimization~~ [Draft][Gaudi][Model] Qwen2.5-VL optimization Apr 29, 2025

wenbinc-Bin mentioned this pull request May 19, 2025

Qwen2.5 omni #1269

Closed

michalkuligowski closed this Aug 21, 2025

	#10,50,100 shows simliar result, without this, we see the program hangs for multiple prompts(with larger images)
	#INFO: %10, 50, 100 show similar results. Without the mark_step here, the model hangs for multiple prompts and/or larger images

	pixel_values = torch.randn(image_grid_thw.prod(), 1176) # TODO: figure out the variable name
	pixel_values = torch.randn(image_grid_thw.prod(), 1176)

	print("einsum running on CPU : ", range_indices.shape)
	logger.info("einsum running on CPU : ", range_indices.shape)

[Draft][Gaudi][Model] Qwen2.5-VL optimization #1109

[Draft][Gaudi][Model] Qwen2.5-VL optimization #1109

Uh oh!

Conversation

malkomes commented Apr 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imangohari1 commented Apr 22, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imangohari1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sayantan-nervana Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

malkomes commented Apr 16, 2025 •

edited by github-actions bot

Loading

sayantan-nervana Apr 23, 2025 •

edited

Loading