-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add generation tests for multimodal generative models #29853
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
torch.arange(input_ids.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1).to(input_ids.device) | ||
) | ||
input_ids = input_ids.index_select(0, expanded_return_idx) | ||
if input_ids is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For me seems like there is no difference and making it this way enables contrastive decoding in Idefics. Should we ask someone who made idefics to look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the slow CI tests are happy, then it should be fine 👍 (do they all pass?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the idefics model tests are passing, including slow
@@ -437,8 +437,6 @@ def forward( | |||
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features( | |||
image_features, inputs_embeds, input_ids, attention_mask, labels | |||
) | |||
if labels is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i do not thinks we should be manually adding labels to calculate loss, if None was passed. Removed to pass some of the tests in all Llavas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
Core maintainer reviewing this: have a look at this question as well :)
sequence_length = self.model_tester.seq_length | ||
inputs_dict_processed = {} | ||
for k, v in inputs_dict.items(): | ||
if not isinstance(v, torch.Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not divide multimodal models to half seq length, because they input ids may have dependencies with pixel values (such like number of images == num of special tokens in inputs).
logits_process_kwargs, _ = self._get_logits_processor_and_warper_kwargs( | ||
input_ids.shape[-1], | ||
model_kwargs["attention_mask"].shape[-1], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input ids
can now be any type of "main input", like pixel values. Attention mask may be more reliable
) | ||
|
||
def test_generate_without_input_ids(self): | ||
config, _, _, max_length = self._get_input_ids_and_config() | ||
for model_class in self.all_generative_model_classes: | ||
config, _, max_length = self._get_input_ids_and_config(model_class) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as a loop over all models, so that we can get the model_class
to infer correct inputs_dict
format
if not config.is_encoder_decoder: | ||
config, inputs_dict, _ = self._get_input_ids_and_config(model_class) | ||
# We want to test only encoder-decoder models, also skip enc-dec models which are multimodal | ||
if not config.is_encoder_decoder or not hasattr(config, "encoder_layers"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some multimodal encoder-decoder models (pix2struct) do not have this attribute.
@@ -60,6 +60,7 @@ | |||
MODEL_FOR_SEQUENCE_CLASSIFICATION_MAPPING_NAMES, | |||
MODEL_FOR_TOKEN_CLASSIFICATION_MAPPING_NAMES, | |||
MODEL_FOR_VIDEO_CLASSIFICATION_MAPPING_NAMES, | |||
MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question here about mapping. I see that "VISION_2_SEQ_MAPPING" does not contain all of vision-language models that can generate. Is it supposed to be that way or we should update it? Right now it is missing Idefics and Fuyu
Should be ready for review now. All tests (new added and old ones) are passing for all models. VipLlava will be fixed after rebasing the mentioned PR. The biggest problem with multimodal models was the use of different main input names in their custom generate. This behavior should be fixed after the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stopped reviewing thoroughly in tests/generation/test_utils.py
.
I noticed that model_class
now needs to be passed, and we have a new prepare_config_and_inputs_for_generation
that redefines a few tested things at a class level. To me, this is a symptom that we're not following the same pattern everywhere, so I have a few questions/comments for you to explore :)
prepare_config_and_inputs_for_generation
often redefinesinputs_dict["input_name"]
. Why do we need the input name at all? The dictionary fromprepare_config_and_inputs_for_common
should work out of the box inforward
andgenerate
(unless I'm missing something)prepare_config_and_inputs_for_generation
also pops a few inputs. If the inputs are not used at generation time,generate
should be able to ignore them, no?- As you wrote,
sequence_length = input_ids.shape[-1] // 2
can get us into trouble, let's get rid of it for all models. Let's also usemax_new_tokens
as opposed tomax_length
[We may need a separate PR to fix these, if the required changes are large]
After these questions are streamlined, I believe you'll need a much smaller diff to enable multimodal generative tests :)
torch.arange(input_ids.shape[0]).view(-1, 1).repeat(1, expand_size).view(-1).to(input_ids.device) | ||
) | ||
input_ids = input_ids.index_select(0, expanded_return_idx) | ||
if input_ids is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the slow CI tests are happy, then it should be fine 👍 (do they all pass?)
@@ -250,7 +250,6 @@ class Kosmos2Config(PretrainedConfig): | |||
vision_config (`dict`, *optional*): | |||
Dictionary of configuration options used to initialize [`Kosmos2VisionConfig`]. | |||
latent_query_num (`int`, *optional*, defaults to 64): | |||
The number of latent query tokens that represent the image features used in the text decoder component. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should not have been removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, probably removed it by accident
@@ -437,8 +437,6 @@ def forward( | |||
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features( | |||
image_features, inputs_embeds, input_ids, attention_mask, labels | |||
) | |||
if labels is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so.
Core maintainer reviewing this: have a look at this question as well :)
|
@zucchini-nlp let's open a separate PR that addresses the 3rd point (get rid of the |
@gante Yes, there is a PR already that you approved ahaha (#30016) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
It was found that none of the vision LLMs have
GenerationTesterMixin
which is why we had many blind spots, which were not being tested. This PR adds the Mixin to all VLMs and fixes where possible. Some tests are still failing, I am working on it.I hope the tests will be much cleaner when we refactor
generate
, because right now I had to add some hacks for tests in models with custom generation.EDIT:
Apparently, idefics does not work right now without enabling cache. To not forget in the upcoming rework of
generate
: the problem of idefics is in the special treatment of image attn mask here.cc @gante