-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix VLM generation issues #32836
Fix VLM generation issues #32836
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@@ -853,7 +853,7 @@ def forward( | |||
inputs_embeds = self.get_input_embeddings()(input_ids) | |||
|
|||
# Merge text and images in prefill stage | |||
if past_key_values is None: | |||
if input_ids.shape[1] != 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if input_ids.shape[1] != 1: | |
if inputs_embeds.shape[1] != 1: |
should we not check inputs_embeds
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope, we don't expect a user to pass inputs embeds and pixels at the same time, because for merging we need to slice image token ids positions. Yes, we have a ValueError in case both are passed, but it is in latest main
.
UPDATE: Okey, this was causing test fails so yeah, I fixed it with embeds to make CI happy. This should be enough for the patch, we'll get the new VLMs on v4.45, will try to add mixin tests until then
@@ -910,7 +910,7 @@ def forward( | |||
pass | |||
|
|||
# generation with cache, decoding stage | |||
elif past_key_values is not None and (pixel_values is not None or pixel_values_videos is not None): | |||
elif pixel_values is not None or pixel_values_videos is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this is changed as the first if is inputs_embeds.shape[1] != 1:
so that is prefill / simple forward (and indeed merging should always be done regardless of generation)
But then here, you need to index the past key values, which don't always exist (forward call with use_cache = False). for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In forward call with no cache, we should go with the first 'if' and never reach the elif
. yeah kinda too many dependencies, but shouldn't cause an error. I'll add a parametrized to the new tests (use-cache or no-cache)
* fix in one commit * add parameterized * fix tests * fix test flakiness * maybe that's why flaky * style * flakiness... --------- Co-authored-by: raushan <raushan@huggingface.co>
What does this PR do?
Fixes generation for llava-next-video. Apparently started failing after we moved to cache class but some parts of the code were not modified. Checked all llava models, others are working since check is done on a different condition.
Yes, we can start using cache_position and rely on that, but we should note that cache_position for VLMs will not be correct and will contain positions only for text tokens. Adding support for cache position will come in the next PR, which is in progress. We;ll have to deprecate many things before we can get rid of the current checks to "merge or expand".
This PR ports the changes from #32527 to a proper patch
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.