You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Most multimodal models support input image embeddings. see previous pr: #6613
IMO there's no reason not to support qwen2vl.
When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties.
For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here
And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.
Might we need to passthrough more parameters for qwen2vl? please me give some tips.
here is my draft code: #8856
Alternatives
No response
Additional context
No response
Before submitting a new issue...
Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
The text was updated successfully, but these errors were encountered:
🚀 The feature, motivation and pitch
Most multimodal models support input image embeddings. see previous pr: #6613
IMO there's no reason not to support qwen2vl.
When I was about to add this feature to qwen2vl. Unfortunately, I've run into some difficulties.
For example, I can't just rely on image embedding to generate new prompt_token_ids without the original image. See here
And here, if we just return image embeds, it will occur an error. AssertionError: mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'.
Might we need to passthrough more parameters for qwen2vl? please me give some tips.
here is my draft code: #8856
Alternatives
No response
Additional context
No response
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: