Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix VLM generation issues #32836

Merged
merged 7 commits into from
Aug 16, 2024
Merged

Fix VLM generation issues #32836

merged 7 commits into from
Aug 16, 2024

Conversation

ArthurZucker
Copy link
Collaborator

@ArthurZucker ArthurZucker commented Aug 15, 2024

What does this PR do?

Fixes generation for llava-next-video. Apparently started failing after we moved to cache class but some parts of the code were not modified. Checked all llava models, others are working since check is done on a different condition.

Yes, we can start using cache_position and rely on that, but we should note that cache_position for VLMs will not be correct and will contain positions only for text tokens. Adding support for cache position will come in the next PR, which is in progress. We;ll have to deprecate many things before we can get rid of the current checks to "merge or expand".

This PR ports the changes from #32527 to a proper patch

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -853,7 +853,7 @@ def forward(
inputs_embeds = self.get_input_embeddings()(input_ids)

# Merge text and images in prefill stage
if past_key_values is None:
if input_ids.shape[1] != 1:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if input_ids.shape[1] != 1:
if inputs_embeds.shape[1] != 1:

should we not check inputs_embeds instead?

Copy link
Member

@zucchini-nlp zucchini-nlp Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope, we don't expect a user to pass inputs embeds and pixels at the same time, because for merging we need to slice image token ids positions. Yes, we have a ValueError in case both are passed, but it is in latest main.

UPDATE: Okey, this was causing test fails so yeah, I fixed it with embeds to make CI happy. This should be enough for the patch, we'll get the new VLMs on v4.45, will try to add mixin tests until then

@@ -910,7 +910,7 @@ def forward(
pass

# generation with cache, decoding stage
elif past_key_values is not None and (pixel_values is not None or pixel_values_videos is not None):
elif pixel_values is not None or pixel_values_videos is not None:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this is changed as the first if is inputs_embeds.shape[1] != 1: so that is prefill / simple forward (and indeed merging should always be done regardless of generation)
But then here, you need to index the past key values, which don't always exist (forward call with use_cache = False). for example

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In forward call with no cache, we should go with the first 'if' and never reach the elif. yeah kinda too many dependencies, but shouldn't cause an error. I'll add a parametrized to the new tests (use-cache or no-cache)

@ArthurZucker ArthurZucker marked this pull request as ready for review August 16, 2024 18:58
@ArthurZucker ArthurZucker merged commit ad87ed0 into v4.44-release Aug 16, 2024
22 of 25 checks passed
@ArthurZucker ArthurZucker deleted the patch_vlms branch August 16, 2024 18:58
ArthurZucker added a commit that referenced this pull request Aug 20, 2024
* fix in one commit

* add parameterized

* fix tests

* fix test flakiness

* maybe that's why flaky

* style

* flakiness...

---------

Co-authored-by: raushan <raushan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants