You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[V1][Molmo] Fix get_multimodal_embeddings() in molmo.py
Expected: get_multimodal_embeddings() should return list[Tensor]
for `GPUModelRunner` to iterate.
Actual: prious to this PR, molmo's _get_mm_embeds() returns a list
thus get_multimodal_embeddings() returns a list of list.
This is reproducible when all of following satisfy:
* more than one request
* the tailing part of each request is a bit different, to trigger partial cache hit
This PR also updates vision_language.py to help reproduce.
Tested with:
```
VLLM_USE_V1=1 \
python examples/offline_inference/vision_language.py \
--model molmo \
--num-prompts=2 \
--use-different-prompt-per-request
```
Signed-off-by: Linkun Chen <github@lkchen.net>
0 commit comments