Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add generation tests for multimodal generative models #29853

Closed
wants to merge 7 commits into from

Conversation

zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Mar 25, 2024

What does this PR do?

It was found that none of the vision LLMs have GenerationTesterMixin which is why we had many blind spots, which were not being tested. This PR adds the Mixin to all VLMs and fixes where possible. Some tests are still failing, I am working on it.

I hope the tests will be much cleaner when we refactor generate, because right now I had to add some hacks for tests in models with custom generation.

EDIT:
Apparently, idefics does not work right now without enabling cache. To not forget in the upcoming rework of generate: the problem of idefics is in the special treatment of image attn mask here.

cc @gante

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

torch.arange(input_ids.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1).to(input_ids.device)
)
input_ids = input_ids.index_select(0, expanded_return_idx)
if input_ids is not None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me seems like there is no difference and making it this way enables contrastive decoding in Idefics. Should we ask someone who made idefics to look?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the slow CI tests are happy, then it should be fine 👍 (do they all pass?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the idefics model tests are passing, including slow

@@ -437,8 +437,6 @@ def forward(
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
image_features, inputs_embeds, input_ids, attention_mask, labels
)
if labels is None:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not thinks we should be manually adding labels to calculate loss, if None was passed. Removed to pass some of the tests in all Llavas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Core maintainer reviewing this: have a look at this question as well :)

sequence_length = self.model_tester.seq_length
inputs_dict_processed = {}
for k, v in inputs_dict.items():
if not isinstance(v, torch.Tensor):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not divide multimodal models to half seq length, because they input ids may have dependencies with pixel values (such like number of images == num of special tokens in inputs).

logits_process_kwargs, _ = self._get_logits_processor_and_warper_kwargs(
input_ids.shape[-1],
model_kwargs["attention_mask"].shape[-1],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input ids can now be any type of "main input", like pixel values. Attention mask may be more reliable

)

def test_generate_without_input_ids(self):
config, _, _, max_length = self._get_input_ids_and_config()
for model_class in self.all_generative_model_classes:
config, _, max_length = self._get_input_ids_and_config(model_class)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as a loop over all models, so that we can get the model_class to infer correct inputs_dict format

if not config.is_encoder_decoder:
config, inputs_dict, _ = self._get_input_ids_and_config(model_class)
# We want to test only encoder-decoder models, also skip enc-dec models which are multimodal
if not config.is_encoder_decoder or not hasattr(config, "encoder_layers"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some multimodal encoder-decoder models (pix2struct) do not have this attribute.

@@ -60,6 +60,7 @@
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES,
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES,
MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question here about mapping. I see that "VISION_2_SEQ_MAPPING" does not contain all of vision-language models that can generate. Is it supposed to be that way or we should update it? Right now it is missing Idefics and Fuyu

@zucchini-nlp zucchini-nlp changed the title Add generation tests for multimodal generative models [WIP] Add generation tests for multimodal generative models Mar 26, 2024
@zucchini-nlp
Copy link
Member Author

Should be ready for review now. All tests (new added and old ones) are passing for all models. VipLlava will be fixed after rebasing the mentioned PR.

The biggest problem with multimodal models was the use of different main input names in their custom generate. This behavior should be fixed after the generate rework is done.

@zucchini-nlp zucchini-nlp requested a review from gante March 26, 2024 13:08
Copy link
Member

@gante gante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopped reviewing thoroughly in tests/generation/test_utils.py.

I noticed that model_class now needs to be passed, and we have a new prepare_config_and_inputs_for_generation that redefines a few tested things at a class level. To me, this is a symptom that we're not following the same pattern everywhere, so I have a few questions/comments for you to explore :)

  1. prepare_config_and_inputs_for_generation often redefines inputs_dict["input_name"]. Why do we need the input name at all? The dictionary from prepare_config_and_inputs_for_common should work out of the box in forward and generate (unless I'm missing something)
  2. prepare_config_and_inputs_for_generation also pops a few inputs. If the inputs are not used at generation time, generate should be able to ignore them, no?
  3. As you wrote, sequence_length = input_ids.shape[-1] // 2 can get us into trouble, let's get rid of it for all models. Let's also use max_new_tokens as opposed to max_length

[We may need a separate PR to fix these, if the required changes are large]

After these questions are streamlined, I believe you'll need a much smaller diff to enable multimodal generative tests :)

torch.arange(input_ids.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1).to(input_ids.device)
)
input_ids = input_ids.index_select(0, expanded_return_idx)
if input_ids is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the slow CI tests are happy, then it should be fine 👍 (do they all pass?)

@@ -250,7 +250,6 @@ class Kosmos2Config(PretrainedConfig):
vision_config (`dict`, *optional*):
Dictionary of configuration options used to initialize [`Kosmos2VisionConfig`].
latent_query_num (`int`, *optional*, defaults to 64):
The number of latent query tokens that represent the image features used in the text decoder component.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not have been removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, probably removed it by accident

@@ -437,8 +437,6 @@ def forward(
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
image_features, inputs_embeds, input_ids, attention_mask, labels
)
if labels is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so.

Core maintainer reviewing this: have a look at this question as well :)

@zucchini-nlp
Copy link
Member Author

@gante

  1. Yes, I found that some model's use custom generate and the main input name in those cases is either pixel_values. So if we leave it as it is and pass in "input_ids" as first main arg and others as model kwargs, those custom calls complain about incorrect arguments. It would be out of box I believe if we called the main generate directly, as I see that we check model forward args and main input names there. However now we cannot get rid of this. Another workaround would be to add separate _get_inputs_and_config for each VLMs test, similar to what is happening in the audio models. That way we can hardcode the method to return (main_input_tensor, other_kwargs_in_dict, config, max_length) in their specific model testers. But I think it is same level of complication as we have now :(
  2. Actually generate raises error for unused model kwargs in this line. And since some VLMs have required args only for that model, trying to get args from prepare_config_and_inputs_for_common by name seemed to be infeasible to generalize, even though it is working for text generative models.
  3. Agreed, we can get rid of it. I just encountered a bunch of new fails on text models, just because of changing seq length so decided to leave it for now. That maybe requires another PR to be merged before this one, to just "remove dividing seq length to two" in text models

@gante
Copy link
Member

gante commented Apr 18, 2024

@zucchini-nlp let's open a separate PR that addresses the 3rd point (get rid of the //2 in sequence_length = input_ids.shape[-1] // 2), and chat about VLM tests during our next meeting. I really want to move in the direction of less complexity, even if that implies structural changes in generate :)

@zucchini-nlp
Copy link
Member Author

@gante Yes, there is a PR already that you approved ahaha (#30016)
Just waiting for a reply from @ArthurZucker on some comments to merge

Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants