Add auto model for image-text-to-text #32472

yonigozlan · 2024-08-06T15:21:26Z

What does this PR do?

Add AutoModelForImageTextToText object in preparation for the image-text-to-text pipeline
Blocking PR for image-text-to-text pipeline.

The following models need to be added to modeling auto:

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@molbap @amyeroberts

HuggingFaceDocBuilderDev · 2024-08-06T15:41:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

leloykun · 2024-08-06T18:22:20Z

src/transformers/models/auto/modeling_auto.py

+        ("blip-2", "Blip2ForConditionalGeneration"),
+        ("fuyu", "FuyuForCausalLM"),


Chameleon can also do Image-text to text

Suggested change

("blip-2", "Blip2ForConditionalGeneration"),

("fuyu", "FuyuForCausalLM"),

("blip-2", "Blip2ForConditionalGeneration"),

("chameleon", "ChameleonForConditionalGeneration"),

("fuyu", "FuyuForCausalLM"),

+1 - though it can also do image-text to image, do we want it in this mapping still?

amyeroberts

Overall looks great - thanks for adding!

Do all of these models share common model outputs? If not, what is the largest subset of outputs shared?

PR in general looks good to merge to me, but let's wait for the PRs for the processors to be unified to be merged first, as we can't group these models yet until that's finalized & merged in

amyeroberts · 2024-08-07T11:40:01Z

src/transformers/models/auto/processing_auto.py

        ("unispeech", "Wav2Vec2Processor"),
        ("unispeech-sat", "Wav2Vec2Processor"),
        ("video_llava", "VideoLlavaProcessor"),
        ("vilt", "ViltProcessor"),
        ("vipllava", "LlavaProcessor"),
+        ("vision-encoder-decoder", "DonutProcessor"),


I don't think we can do this - vision-encoder-decoder is generic composite model. The user can specify any encoder or decoder they want, and so there is no mapping to any processors

+1, was about to say the same

Oh yes that makes sense, I think we can guess the processor type from the model name like it is done here for image_processor instead:

transformers/src/transformers/pipelines/__init__.py

Lines 998 to 1015 in e0d8253

if load_image_processor:

# Try to infer image processor from model or config name (if provided as str)

if image_processor is None:

if isinstance(model_name, str):

image_processor = model_name

elif isinstance(config, str):

image_processor = config

# Backward compatibility, as `feature_extractor` used to be the name

# for `ImageProcessor`.

elif feature_extractor is not None and isinstance(feature_extractor, BaseImageProcessor):

image_processor = feature_extractor

else:

# Impossible to guess what is the right image_processor here

raise Exception(

"Impossible to guess which image processor to use. "

"Please provide a PreTrainedImageProcessor class or a path/identifier "

"to a pretrained image processor."

)

Yep - I think that's the way to do it in the pipeline! I wouldn't worry too much about vision-encoder-decoder support. It's good to have, but because the model is by definition a bit of a frankenstein it's likely to not be fully compatible in all cases anyway.

molbap

Tried out the models with this mapping - works fine, thanks!

molbap · 2024-08-07T12:05:06Z

src/transformers/models/auto/modeling_auto.py

+        ("blip-2", "Blip2ForConditionalGeneration"),
+        ("fuyu", "FuyuForCausalLM"),


+1 - though it can also do image-text to image, do we want it in this mapping still?

molbap · 2024-08-07T12:06:26Z

src/transformers/models/auto/processing_auto.py

        ("unispeech", "Wav2Vec2Processor"),
        ("unispeech-sat", "Wav2Vec2Processor"),
        ("video_llava", "VideoLlavaProcessor"),
        ("vilt", "ViltProcessor"),
        ("vipllava", "LlavaProcessor"),
+        ("vision-encoder-decoder", "DonutProcessor"),


+1, was about to say the same

…t models

yonigozlan · 2024-10-04T21:17:01Z

Now that all the image-text-to-text models have had their processor standardized, I think we can safely merge this PR @ArthurZucker @NielsRogge

NielsRogge · 2024-10-04T21:26:41Z

src/transformers/models/auto/modeling_auto.py

@@ -753,6 +753,32 @@
    ]
 )

+MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict(


Should the existing classes be deleted from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES or do we keep them there?

I just removed the ones that previously were in IGNORE_NON_AUTO_CONFIGURED.
Won't deleting them from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES be a problem for BC?

can also add llava-next-video and video-llava? Those two can work with image+text inputs, as well as video+text

ArthurZucker

LGTM, but this need:

tests where you use the class
documentation: with small example of how to use / with llava for example!

yonigozlan · 2024-10-07T00:10:52Z

I replaced some tests that were using AutoModelForVision2Seq with AutoModelForImageTextToText and also swapped some instances of [Model]ForConditionalGeneration with AutoModelForImageTextToText in the documentation.

Of course, there will be more tests and documentation updates including AutoModelForImageTextToText once the ImageTextToText pipeline is added.

Did you have any other specific tests or doc in mind @ArthurZucker? I didn’t find many tests using AutoClasses except in the pipeline tests.

ArthurZucker

Okay for adding pipeline tests that use it in a follow up pr 😉 LGTM

* Add Auto model for image-text-to-text * Remove donut from processing auto, add chameleon ti image text to text models * add qwen2_vl and llava_onevision * add pixtral to auto model for image-text-to-text * add mllama and idefics3 * remove models in IGNORE_NON_AUTO_CONFIGURED * add AutoModelForImageTextToText to tests and doc

yonigozlan marked this pull request as ready for review August 6, 2024 15:36

yonigozlan requested a review from molbap August 6, 2024 15:36

leloykun reviewed Aug 6, 2024

View reviewed changes

amyeroberts reviewed Aug 7, 2024

View reviewed changes

molbap reviewed Aug 7, 2024

View reviewed changes

leloykun mentioned this pull request Aug 22, 2024

Auto model & pipeline for image-text-to-image-text models #32926

Open

14 tasks

yonigozlan force-pushed the add-auto-model-for-image-text-to-text branch from 4862d68 to a811755 Compare September 12, 2024 15:13

yonigozlan force-pushed the add-auto-model-for-image-text-to-text branch from a811755 to a0b31e5 Compare September 24, 2024 00:41

yonigozlan added 5 commits October 2, 2024 14:56

Add Auto model for image-text-to-text

2eeec14

Remove donut from processing auto, add chameleon ti image text to tex…

65a7144

…t models

add qwen2_vl and llava_onevision

6186091

add pixtral to auto model for image-text-to-text

ff5d2c6

add mllama and idefics3

27f50d3

yonigozlan force-pushed the add-auto-model-for-image-text-to-text branch from a0b31e5 to 27f50d3 Compare October 2, 2024 14:59

NielsRogge mentioned this pull request Oct 4, 2024

AutoModelForConditionalGeneration #33960

Closed

yonigozlan requested review from NielsRogge and ArthurZucker October 4, 2024 21:15

NielsRogge reviewed Oct 4, 2024

View reviewed changes

NielsRogge approved these changes Oct 4, 2024

View reviewed changes

remove models in IGNORE_NON_AUTO_CONFIGURED

13387d5

ArthurZucker reviewed Oct 5, 2024

View reviewed changes

add AutoModelForImageTextToText to tests and doc

9311ba2

yonigozlan requested a review from ArthurZucker October 7, 2024 14:17

ArthurZucker approved these changes Oct 8, 2024

View reviewed changes

yonigozlan merged commit e2001c3 into huggingface:main Oct 8, 2024
24 checks passed

NielsRogge mentioned this pull request Oct 15, 2024

AutoModel class for image-text-to-text models #32042

Closed

Cyrilvallez mentioned this pull request Jan 6, 2025

Add TimesFM Time Series Forecasting Model #34082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add auto model for image-text-to-text #32472

Add auto model for image-text-to-text #32472

yonigozlan commented Aug 6, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Aug 6, 2024

leloykun Aug 6, 2024

molbap Aug 7, 2024

amyeroberts left a comment

amyeroberts Aug 7, 2024

molbap Aug 7, 2024

yonigozlan Aug 7, 2024

amyeroberts Aug 7, 2024

molbap left a comment

molbap Aug 7, 2024

molbap Aug 7, 2024

yonigozlan commented Oct 4, 2024

NielsRogge Oct 4, 2024

yonigozlan Oct 4, 2024

zucchini-nlp Oct 5, 2024

ArthurZucker left a comment

yonigozlan commented Oct 7, 2024

ArthurZucker left a comment

		("blip-2", "Blip2ForConditionalGeneration"),
		("fuyu", "FuyuForCausalLM"),

	if load_image_processor:
	# Try to infer image processor from model or config name (if provided as str)
	if image_processor is None:
	if isinstance(model_name, str):
	image_processor = model_name
	elif isinstance(config, str):
	image_processor = config
	# Backward compatibility, as `feature_extractor` used to be the name
	# for `ImageProcessor`.
	elif feature_extractor is not None and isinstance(feature_extractor, BaseImageProcessor):
	image_processor = feature_extractor
	else:
	# Impossible to guess what is the right image_processor here
	raise Exception(
	"Impossible to guess which image processor to use. "
	"Please provide a PreTrainedImageProcessor class or a path/identifier "
	"to a pretrained image processor."
	)

Add auto model for image-text-to-text #32472

Add auto model for image-text-to-text #32472

Conversation

yonigozlan commented Aug 6, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amyeroberts left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

molbap left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yonigozlan commented Oct 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

yonigozlan commented Oct 7, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

yonigozlan commented Aug 6, 2024 •

edited

Loading