Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add auto model for image-text-to-text #32472

Merged

Conversation

yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Aug 6, 2024

What does this PR do?

Add AutoModelForImageTextToText object in preparation for the image-text-to-text pipeline
Blocking PR for image-text-to-text pipeline.

The following models need to be added to modeling auto:

  • GIT
  • BLIP
  • BLIP-2
  • IDEFICS
  • InstructBLIP
  • LLaVa
  • Fuyu
  • Pix2Struct/DePlot/MatCha
  • UDOP
  • Donut
  • KOSMOS-2
  • Idefics2
  • LLaVA-NeXT
  • PaliGemma
  • VipLlava
  • Chameleon
  • LlaVa OneVision
  • Qwen2-VL
  • Pixtral
  • Idefics3
  • MLlama

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@molbap @amyeroberts

@yonigozlan yonigozlan marked this pull request as ready for review August 6, 2024 15:36
@yonigozlan yonigozlan requested a review from molbap August 6, 2024 15:36
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment on lines 725 to 761
("blip-2", "Blip2ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chameleon can also do Image-text to text

Suggested change
("blip-2", "Blip2ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
("blip-2", "Blip2ForConditionalGeneration"),
("chameleon", "ChameleonForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - though it can also do image-text to image, do we want it in this mapping still?

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great - thanks for adding!

Do all of these models share common model outputs? If not, what is the largest subset of outputs shared?

PR in general looks good to merge to me, but let's wait for the PRs for the processors to be unified to be merged first, as we can't group these models yet until that's finalized & merged in

("unispeech", "Wav2Vec2Processor"),
("unispeech-sat", "Wav2Vec2Processor"),
("video_llava", "VideoLlavaProcessor"),
("vilt", "ViltProcessor"),
("vipllava", "LlavaProcessor"),
("vision-encoder-decoder", "DonutProcessor"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can do this - vision-encoder-decoder is generic composite model. The user can specify any encoder or decoder they want, and so there is no mapping to any processors

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, was about to say the same

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes that makes sense, I think we can guess the processor type from the model name like it is done here for image_processor instead:

if load_image_processor:
# Try to infer image processor from model or config name (if provided as str)
if image_processor is None:
if isinstance(model_name, str):
image_processor = model_name
elif isinstance(config, str):
image_processor = config
# Backward compatibility, as `feature_extractor` used to be the name
# for `ImageProcessor`.
elif feature_extractor is not None and isinstance(feature_extractor, BaseImageProcessor):
image_processor = feature_extractor
else:
# Impossible to guess what is the right image_processor here
raise Exception(
"Impossible to guess which image processor to use. "
"Please provide a PreTrainedImageProcessor class or a path/identifier "
"to a pretrained image processor."
)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - I think that's the way to do it in the pipeline! I wouldn't worry too much about vision-encoder-decoder support. It's good to have, but because the model is by definition a bit of a frankenstein it's likely to not be fully compatible in all cases anyway.

Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried out the models with this mapping - works fine, thanks!

Comment on lines 725 to 761
("blip-2", "Blip2ForConditionalGeneration"),
("fuyu", "FuyuForCausalLM"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 - though it can also do image-text to image, do we want it in this mapping still?

("unispeech", "Wav2Vec2Processor"),
("unispeech-sat", "Wav2Vec2Processor"),
("video_llava", "VideoLlavaProcessor"),
("vilt", "ViltProcessor"),
("vipllava", "LlavaProcessor"),
("vision-encoder-decoder", "DonutProcessor"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, was about to say the same

@yonigozlan yonigozlan force-pushed the add-auto-model-for-image-text-to-text branch from 4862d68 to a811755 Compare September 12, 2024 15:13
@yonigozlan yonigozlan force-pushed the add-auto-model-for-image-text-to-text branch from a811755 to a0b31e5 Compare September 24, 2024 00:41
@yonigozlan yonigozlan force-pushed the add-auto-model-for-image-text-to-text branch from a0b31e5 to 27f50d3 Compare October 2, 2024 14:59
@yonigozlan
Copy link
Member Author

Now that all the image-text-to-text models have had their processor standardized, I think we can safely merge this PR @ArthurZucker @NielsRogge

@@ -753,6 +753,32 @@
]
)

MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the existing classes be deleted from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES or do we keep them there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just removed the ones that previously were in IGNORE_NON_AUTO_CONFIGURED.
Won't deleting them from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES be a problem for BC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can also add llava-next-video and video-llava? Those two can work with image+text inputs, as well as video+text

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but this need:

  • tests where you use the class
  • documentation: with small example of how to use / with llava for example!

@yonigozlan
Copy link
Member Author

I replaced some tests that were using AutoModelForVision2Seq with AutoModelForImageTextToText and also swapped some instances of [Model]ForConditionalGeneration with AutoModelForImageTextToText in the documentation.

Of course, there will be more tests and documentation updates including AutoModelForImageTextToText once the ImageTextToText pipeline is added.

Did you have any other specific tests or doc in mind @ArthurZucker? I didn’t find many tests using AutoClasses except in the pipeline tests.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay for adding pipeline tests that use it in a follow up pr 😉 LGTM

@yonigozlan yonigozlan merged commit e2001c3 into huggingface:main Oct 8, 2024
24 checks passed
BernardZach pushed a commit to BernardZach/transformers that referenced this pull request Dec 5, 2024
* Add Auto model for image-text-to-text

* Remove donut from processing auto, add chameleon ti image text to text models

* add qwen2_vl and llava_onevision

* add pixtral to auto model for image-text-to-text

* add mllama and idefics3

* remove models in IGNORE_NON_AUTO_CONFIGURED

* add AutoModelForImageTextToText to tests and doc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants