Skip to content

Commit 6de3d43

Browse files
ywang96simon-mo
authored andcommitted
[MM] Optimize memory profiling for scattered multimodal embeddings (#25810)
Signed-off-by: Roger Wang <hey@rogerw.io> Signed-off-by: simon-mo <simon.mo@hey.com>
1 parent b14773b commit 6de3d43

File tree

1 file changed

+17
-0
lines changed

1 file changed

+17
-0
lines changed

vllm/v1/worker/gpu_model_runner.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3351,6 +3351,23 @@ def profile_run(self) -> None:
33513351
expected_num_items=max_mm_items_per_batch,
33523352
)
33533353

3354+
# NOTE: This happens when encoder cache needs to store
3355+
# the embeddings that encoder outputs are scattered onto.
3356+
# In this case we create dummy embeddings of size
3357+
# (encode_budget, hidden_size) and scatter encoder
3358+
# output into it.
3359+
encoder_output_shape = dummy_encoder_outputs[0].shape
3360+
if encoder_output_shape[0] < encoder_budget:
3361+
expanded_outputs = []
3362+
for output in dummy_encoder_outputs:
3363+
expanded = output.new_zeros(
3364+
(encoder_budget, encoder_output_shape[-1]))
3365+
num_tokens = output.shape[0]
3366+
expanded[:num_tokens].copy_(output)
3367+
expanded_outputs.append(expanded)
3368+
3369+
dummy_encoder_outputs = expanded_outputs
3370+
33543371
# Cache the dummy encoder outputs.
33553372
self.encoder_cache["tmp"] = dict(
33563373
enumerate(dummy_encoder_outputs))

0 commit comments

Comments
 (0)