-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add auto model for image-text-to-text #32472
Add auto model for image-text-to-text #32472
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
("blip-2", "Blip2ForConditionalGeneration"), | ||
("fuyu", "FuyuForCausalLM"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chameleon can also do Image-text to text
("blip-2", "Blip2ForConditionalGeneration"), | |
("fuyu", "FuyuForCausalLM"), | |
("blip-2", "Blip2ForConditionalGeneration"), | |
("chameleon", "ChameleonForConditionalGeneration"), | |
("fuyu", "FuyuForCausalLM"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - though it can also do image-text to image, do we want it in this mapping still?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great - thanks for adding!
Do all of these models share common model outputs? If not, what is the largest subset of outputs shared?
PR in general looks good to merge to me, but let's wait for the PRs for the processors to be unified to be merged first, as we can't group these models yet until that's finalized & merged in
("unispeech", "Wav2Vec2Processor"), | ||
("unispeech-sat", "Wav2Vec2Processor"), | ||
("video_llava", "VideoLlavaProcessor"), | ||
("vilt", "ViltProcessor"), | ||
("vipllava", "LlavaProcessor"), | ||
("vision-encoder-decoder", "DonutProcessor"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can do this - vision-encoder-decoder is generic composite model. The user can specify any encoder or decoder they want, and so there is no mapping to any processors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, was about to say the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes that makes sense, I think we can guess the processor type from the model name like it is done here for image_processor instead:
transformers/src/transformers/pipelines/__init__.py
Lines 998 to 1015 in e0d8253
if load_image_processor: | |
# Try to infer image processor from model or config name (if provided as str) | |
if image_processor is None: | |
if isinstance(model_name, str): | |
image_processor = model_name | |
elif isinstance(config, str): | |
image_processor = config | |
# Backward compatibility, as `feature_extractor` used to be the name | |
# for `ImageProcessor`. | |
elif feature_extractor is not None and isinstance(feature_extractor, BaseImageProcessor): | |
image_processor = feature_extractor | |
else: | |
# Impossible to guess what is the right image_processor here | |
raise Exception( | |
"Impossible to guess which image processor to use. " | |
"Please provide a PreTrainedImageProcessor class or a path/identifier " | |
"to a pretrained image processor." | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - I think that's the way to do it in the pipeline! I wouldn't worry too much about vision-encoder-decoder support. It's good to have, but because the model is by definition a bit of a frankenstein it's likely to not be fully compatible in all cases anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tried out the models with this mapping - works fine, thanks!
("blip-2", "Blip2ForConditionalGeneration"), | ||
("fuyu", "FuyuForCausalLM"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 - though it can also do image-text to image, do we want it in this mapping still?
("unispeech", "Wav2Vec2Processor"), | ||
("unispeech-sat", "Wav2Vec2Processor"), | ||
("video_llava", "VideoLlavaProcessor"), | ||
("vilt", "ViltProcessor"), | ||
("vipllava", "LlavaProcessor"), | ||
("vision-encoder-decoder", "DonutProcessor"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, was about to say the same
4862d68
to
a811755
Compare
a811755
to
a0b31e5
Compare
a0b31e5
to
27f50d3
Compare
Now that all the image-text-to-text models have had their processor standardized, I think we can safely merge this PR @ArthurZucker @NielsRogge |
@@ -753,6 +753,32 @@ | |||
] | |||
) | |||
|
|||
MODEL_FOR_IMAGE_TEXT_TO_TEXT_MAPPING_NAMES = OrderedDict( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the existing classes be deleted from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
or do we keep them there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just removed the ones that previously were in IGNORE_NON_AUTO_CONFIGURED
.
Won't deleting them from MODEL_FOR_VISION_2_SEQ_MAPPING_NAMES
be a problem for BC?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can also add llava-next-video and video-llava? Those two can work with image+text inputs, as well as video+text
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but this need:
- tests where you use the class
- documentation: with small example of how to use / with llava for example!
I replaced some tests that were using Of course, there will be more tests and documentation updates including Did you have any other specific tests or doc in mind @ArthurZucker? I didn’t find many tests using AutoClasses except in the pipeline tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay for adding pipeline tests that use it in a follow up pr 😉 LGTM
* Add Auto model for image-text-to-text * Remove donut from processing auto, add chameleon ti image text to text models * add qwen2_vl and llava_onevision * add pixtral to auto model for image-text-to-text * add mllama and idefics3 * remove models in IGNORE_NON_AUTO_CONFIGURED * add AutoModelForImageTextToText to tests and doc
What does this PR do?
Add
AutoModelForImageTextToText
object in preparation for the image-text-to-text pipelineBlocking PR for
image-text-to-text
pipeline.The following models need to be added to modeling auto:
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@molbap @amyeroberts