Add generation tests for multimodal generative models #29853

zucchini-nlp · 2024-03-25T15:48:49Z

What does this PR do?

It was found that none of the vision LLMs have GenerationTesterMixin which is why we had many blind spots, which were not being tested. This PR adds the Mixin to all VLMs and fixes where possible. Some tests are still failing, I am working on it.

I hope the tests will be much cleaner when we refactor generate, because right now I had to add some hacks for tests in models with custom generation.

EDIT:
Apparently, idefics does not work right now without enabling cache. To not forget in the upcoming rework of generate: the problem of idefics is in the special treatment of image attn mask here.

cc @gante

HuggingFaceDocBuilderDev · 2024-03-25T16:08:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2024-03-26T09:47:13Z

src/transformers/models/idefics/modeling_idefics.py

-        torch.arange(input_ids.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1).to(input_ids.device)
-    )
-    input_ids = input_ids.index_select(0, expanded_return_idx)
+    if input_ids is not None:


For me seems like there is no difference and making it this way enables contrastive decoding in Idefics. Should we ask someone who made idefics to look?

If the slow CI tests are happy, then it should be fine 👍 (do they all pass?)

yes, the idefics model tests are passing, including slow

zucchini-nlp · 2024-03-26T09:47:53Z

src/transformers/models/llava/modeling_llava.py

@@ -437,8 +437,6 @@ def forward(
                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                    image_features, inputs_embeds, input_ids, attention_mask, labels
                )
-                if labels is None:


i do not thinks we should be manually adding labels to calculate loss, if None was passed. Removed to pass some of the tests in all Llavas

I think so.

Core maintainer reviewing this: have a look at this question as well :)

zucchini-nlp · 2024-03-26T09:52:09Z

tests/generation/test_utils.py

+            sequence_length = self.model_tester.seq_length
+            inputs_dict_processed = {}
+            for k, v in inputs_dict.items():
+                if not isinstance(v, torch.Tensor):


Let's not divide multimodal models to half seq length, because they input ids may have dependencies with pixel values (such like number of images == num of special tokens in inputs).

zucchini-nlp · 2024-03-26T09:53:09Z

tests/generation/test_utils.py

        logits_process_kwargs, _ = self._get_logits_processor_and_warper_kwargs(
-            input_ids.shape[-1],
+            model_kwargs["attention_mask"].shape[-1],


input ids can now be any type of "main input", like pixel values. Attention mask may be more reliable

zucchini-nlp · 2024-03-26T09:58:38Z

tests/generation/test_utils.py

            )

    def test_generate_without_input_ids(self):
-        config, _, _, max_length = self._get_input_ids_and_config()
+        for model_class in self.all_generative_model_classes:
+            config, _, max_length = self._get_input_ids_and_config(model_class)


added as a loop over all models, so that we can get the model_class to infer correct inputs_dict format

zucchini-nlp · 2024-03-26T10:00:40Z

tests/generation/test_utils.py

-            if not config.is_encoder_decoder:
+            config, inputs_dict, _ = self._get_input_ids_and_config(model_class)
+            # We want to test only encoder-decoder models, also skip enc-dec models which are multimodal
+            if not config.is_encoder_decoder or not hasattr(config, "encoder_layers"):


Some multimodal encoder-decoder models (pix2struct) do not have this attribute.

zucchini-nlp · 2024-03-26T10:28:20Z

tests/test_modeling_common.py

@@ -60,6 +60,7 @@
    MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES,
    MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES,
    MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES,
+    MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES,


One question here about mapping. I see that "VISION_2_SEQ_MAPPING" does not contain all of vision-language models that can generate. Is it supposed to be that way or we should update it? Right now it is missing Idefics and Fuyu

zucchini-nlp · 2024-03-26T13:08:36Z

Should be ready for review now. All tests (new added and old ones) are passing for all models. VipLlava will be fixed after rebasing the mentioned PR.

The biggest problem with multimodal models was the use of different main input names in their custom generate. This behavior should be fixed after the generate rework is done.

gante

Stopped reviewing thoroughly in tests/generation/test_utils.py.

I noticed that model_class now needs to be passed, and we have a new prepare_config_and_inputs_for_generation that redefines a few tested things at a class level. To me, this is a symptom that we're not following the same pattern everywhere, so I have a few questions/comments for you to explore :)

prepare_config_and_inputs_for_generation often redefines inputs_dict["input_name"]. Why do we need the input name at all? The dictionary from prepare_config_and_inputs_for_common should work out of the box in forward and generate (unless I'm missing something)
prepare_config_and_inputs_for_generation also pops a few inputs. If the inputs are not used at generation time, generate should be able to ignore them, no?
As you wrote, sequence_length = input_ids.shape[-1] // 2 can get us into trouble, let's get rid of it for all models. Let's also use max_new_tokens as opposed to max_length

[We may need a separate PR to fix these, if the required changes are large]

After these questions are streamlined, I believe you'll need a much smaller diff to enable multimodal generative tests :)

gante · 2024-03-28T12:13:15Z

src/transformers/models/idefics/modeling_idefics.py

-        torch.arange(input_ids.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1).to(input_ids.device)
-    )
-    input_ids = input_ids.index_select(0, expanded_return_idx)
+    if input_ids is not None:


If the slow CI tests are happy, then it should be fine 👍 (do they all pass?)

gante · 2024-03-28T12:14:35Z

src/transformers/models/kosmos2/configuration_kosmos2.py

@@ -250,7 +250,6 @@ class Kosmos2Config(PretrainedConfig):
        vision_config (`dict`, *optional*):
            Dictionary of configuration options used to initialize [`Kosmos2VisionConfig`].
        latent_query_num (`int`, *optional*, defaults to 64):
-            The number of latent query tokens that represent the image features used in the text decoder component.


This should not have been removed?

oops, probably removed it by accident

gante · 2024-03-28T12:59:19Z

src/transformers/models/llava/modeling_llava.py

@@ -437,8 +437,6 @@ def forward(
                inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
                    image_features, inputs_embeds, input_ids, attention_mask, labels
                )
-                if labels is None:


I think so.

Core maintainer reviewing this: have a look at this question as well :)

zucchini-nlp · 2024-04-01T11:04:44Z

@gante

Yes, I found that some model's use custom generate and the main input name in those cases is either pixel_values. So if we leave it as it is and pass in "input_ids" as first main arg and others as model kwargs, those custom calls complain about incorrect arguments. It would be out of box I believe if we called the main generate directly, as I see that we check model forward args and main input names there. However now we cannot get rid of this. Another workaround would be to add separate _get_inputs_and_config for each VLMs test, similar to what is happening in the audio models. That way we can hardcode the method to return (main_input_tensor, other_kwargs_in_dict, config, max_length) in their specific model testers. But I think it is same level of complication as we have now :(
Actually generate raises error for unused model kwargs in this line. And since some VLMs have required args only for that model, trying to get args from prepare_config_and_inputs_for_common by name seemed to be infeasible to generalize, even though it is working for text generative models.
Agreed, we can get rid of it. I just encountered a bunch of new fails on text models, just because of changing seq length so decided to leave it for now. That maybe requires another PR to be merged before this one, to just "remove dividing seq length to two" in text models

gante · 2024-04-18T16:09:53Z

@zucchini-nlp let's open a separate PR that addresses the 3rd point (get rid of the //2 in sequence_length = input_ids.shape[-1] // 2), and chat about VLM tests during our next meeting. I really want to move in the direction of less complexity, even if that implies structural changes in generate :)

zucchini-nlp · 2024-04-18T17:34:39Z

@gante Yes, there is a PR already that you approved ahaha (#30016)
Just waiting for a reply from @ArthurZucker on some comments to merge

github-actions · 2024-05-13T08:04:10Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

zucchini-nlp added 2 commits March 22, 2024 17:01

add tests for multimodal LLMs

24f9318

test, fix, repeat

a12ad40

zucchini-nlp added 4 commits March 25, 2024 20:11

skip some tests and fix others

836ef60

idefics cross-attn mask

85bd964

fix failing tests

4ee3c0f

Merge remote-tracking branch 'upstream/main' into generation_tests

a7d2c3b

zucchini-nlp commented Mar 26, 2024

View reviewed changes

fix failing tests

73dfd11

zucchini-nlp mentioned this pull request Mar 26, 2024

Fix vipllava for generation #29874

Merged

zucchini-nlp changed the title ~~Add generation tests for multimodal generative models [WIP]~~ Add generation tests for multimodal generative models Mar 26, 2024

zucchini-nlp requested a review from gante March 26, 2024 13:08

gante reviewed Mar 28, 2024

View reviewed changes

github-actions bot closed this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generation tests for multimodal generative models #29853

Add generation tests for multimodal generative models #29853

zucchini-nlp commented Mar 25, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 25, 2024

zucchini-nlp Mar 26, 2024

gante Mar 28, 2024

zucchini-nlp Apr 1, 2024

zucchini-nlp Mar 26, 2024

gante Mar 28, 2024

zucchini-nlp Mar 26, 2024

zucchini-nlp Mar 26, 2024

zucchini-nlp Mar 26, 2024

zucchini-nlp Mar 26, 2024

zucchini-nlp Mar 26, 2024

zucchini-nlp commented Mar 26, 2024

gante left a comment

gante Mar 28, 2024

gante Mar 28, 2024

zucchini-nlp Apr 1, 2024

gante Mar 28, 2024

zucchini-nlp commented Apr 1, 2024

gante commented Apr 18, 2024

zucchini-nlp commented Apr 18, 2024

github-actions bot commented May 13, 2024

Add generation tests for multimodal generative models #29853

Add generation tests for multimodal generative models #29853

Conversation

zucchini-nlp commented Mar 25, 2024 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Mar 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Mar 26, 2024

gante left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zucchini-nlp commented Apr 1, 2024

gante commented Apr 18, 2024

zucchini-nlp commented Apr 18, 2024

github-actions bot commented May 13, 2024

zucchini-nlp commented Mar 25, 2024 •

edited

Loading