Add image text to text pipeline #34170

yonigozlan · 2024-10-15T08:51:56Z

What does this PR do?

Add image-text-to-text pipeline!

A split of this PR with only model-specific pre and post processing is available here, in order to reduce the loc count and number of files changed before merging this PR.

Note: The use of a "legacy" kwarg to modify the preprocessing of some image-text-to-text models is needed here if we want to integrate those models into this pipeline. However, the way it is handled might not be ideal, so I'm open to suggestion on how to improve this.

the pipeline support the following inputs:

unbatched images and text - images=image, text=text
batched images and text - images = [image, image], text= [text, text]
several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."]
Chat templates (for models supporting them).

TODOs:

Add pipeline tests in model-specific test files
Update tasks documentation?

Known current limitations/bugs:

Using prompts without image tokens with models that expect them will throw an error. Should we automatically add image tokens to prompts and display a warning? For now, only a warning is displayed if the model's processor has an image token.
Using several images per prompt for models who do not support the use of an image token) will raise an uncaught error.
Donut doesn't work, as there is a problem identifying the correct model type for it
Idefics3 will raise an uncaught error if no correct image tokens are provided, fixed here Use non nested images and batched text Idefics2/3 #34222
Pixtral with batched input raises Pipeline with tokenizer without pad_token cannot do batching. You can try to set it with pipe.tokenizer.pad_token_id = model.config.eos_token_id.

Examples of usage:

>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> image = "./tests/fixtures/tests_samples/COCO/000000039769.png"
>>> text = "<image> What this is? Assistant: This is"
>>> pipe(image, text=text, max_new_tokens=20)
[
    [
        {
            "input_text": "<image> What this is? Assistant: This is",
            "generated_text": "<image> What this is? Assistant: This is a photo of two cats lying on a pink blanket. The cats are sleeping and appear to be comfortable",
        }
    ]
],

>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>>     {
>>>         "role": "user",
>>>         "content": [
>>>             {
>>>                 "type": "image",
>>>                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>>             },
>>>             {"type": "text", "text": "Describe this image."},
>>>         ],
>>>     }
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
>>> print(outputs[0]["generated_text"])
"In the image, a woman is sitting on the sandy beach, her legs crossed in a relaxed manner"

>>> from transformers import pipeline
>>> pipe = pipeline("image-text-to-text", model="llava-hf/llava-interleave-qwen-0.5b-hf")
>>> messages = [
>>>     {
>>>         "role": "user",
>>>         "content": [
>>>             {
>>>                 "type": "image",
>>>                 "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
>>>             },
>>>             {"type": "text", "text": "Describe this image."},
>>>         ],
>>>     },
>>>     {
>>>         "role": "assistant",
>>>         "content": [
>>>             {"type": "text", "text": "There is a dog and"},
>>>         ],
>>>     },
>>> ]
>>> outputs = pipe(text=messages, max_new_tokens=20)
>>> print(outputs[0]["generated_text"])
[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "There is a dog and a person in the image. The dog is sitting on the sand, and the person is sitting on",
            }
        ],
    },
]

Who can review?

@Rocketknight1 @molbap @qubvel @NielsRogge

HuggingFaceDocBuilderDev · 2024-10-15T09:33:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/transformers/pipelines/image_text_to_text.py

tests/pipelines/test_pipelines_image_text_to_text.py

src/transformers/models/donut/processing_donut.py

src/transformers/models/paligemma/processing_paligemma.py

src/transformers/pipelines/image_text_to_text.py

knkski · 2024-10-15T16:32:34Z

Will it be possible to use this PR for just text generation with a image-capable model? I'm trying to use this PR (at commit 4ac2d1f) with meta-llama/Llama-3.2-90B-Vision-Instruct so that I can compare the language capabilities vs Llama 3.1 70B, and I don't need to use the image support.

I tried calling it like this:

pipe = pipeline(
    "image-text-to-text",
    model="meta-llama/Llama-3.2-90B-Vision-Instruct", 
    device_map="auto",
)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is 1+1?"},
        ],
    }
]
outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
print(outputs[0]["generated_text"])

That resulted in this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:393, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
    392 try:
--> 393     model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
    394         dtype=self.torch_dtype
    395     )
    396 except TypeError:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:285, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
    284 _ = text_kwargs.pop("padding_side", None)  # hack until padding-side is an accepted kwarg by tokenizers
--> 285 encoding = self.tokenizer(text, **text_kwargs)
    286 data.update(encoding)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3020, in PreTrainedTokenizerBase.__call__(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   3019         self._switch_to_input_mode()
-> 3020     encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
   3021 if text_target is not None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3108, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   3107     batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 3108     return self.batch_encode_plus(
   3109         batch_text_or_text_pairs=batch_text_or_text_pairs,
   3110         add_special_tokens=add_special_tokens,
   3111         padding=padding,
   3112         truncation=truncation,
   3113         max_length=max_length,
   3114         stride=stride,
   3115         is_split_into_words=is_split_into_words,
   3116         pad_to_multiple_of=pad_to_multiple_of,
   3117         padding_side=padding_side,
   3118         return_tensors=return_tensors,
   3119         return_token_type_ids=return_token_type_ids,
   3120         return_attention_mask=return_attention_mask,
   3121         return_overflowing_tokens=return_overflowing_tokens,
   3122         return_special_tokens_mask=return_special_tokens_mask,
   3123         return_offsets_mapping=return_offsets_mapping,
   3124         return_length=return_length,
   3125         verbose=verbose,
   3126         split_special_tokens=split_special_tokens,
   3127         **kwargs,
   3128     )
   3129 else:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:3310, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, padding_side, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, split_special_tokens, **kwargs)
   3301 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
   3302     padding=padding,
   3303     truncation=truncation,
   (...)
   3307     **kwargs,
   3308 )
-> 3310 return self._batch_encode_plus(
   3311     batch_text_or_text_pairs=batch_text_or_text_pairs,
   3312     add_special_tokens=add_special_tokens,
   3313     padding_strategy=padding_strategy,
   3314     truncation_strategy=truncation_strategy,
   3315     max_length=max_length,
   3316     stride=stride,
   3317     is_split_into_words=is_split_into_words,
   3318     pad_to_multiple_of=pad_to_multiple_of,
   3319     padding_side=padding_side,
   3320     return_tensors=return_tensors,
   3321     return_token_type_ids=return_token_type_ids,
   3322     return_attention_mask=return_attention_mask,
   3323     return_overflowing_tokens=return_overflowing_tokens,
   3324     return_special_tokens_mask=return_special_tokens_mask,
   3325     return_offsets_mapping=return_offsets_mapping,
   3326     return_length=return_length,
   3327     verbose=verbose,
   3328     split_special_tokens=split_special_tokens,
   3329     **kwargs,
   3330 )

TypeError: PreTrainedTokenizerFast._batch_encode_plus() got an unexpected keyword argument 'legacy'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[5], line 9
      1 messages = [
      2     {
      3         "role": "user",
   (...)
      7     }
      8 ]
----> 9 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     10 print(outputs[0]["generated_text"])

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
    285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
    286     text[0], (list, tuple, dict)
    287 ):
    288     # We have one or more prompts in list-of-dicts format, so this is chat mode
    290     if isinstance(text[0], dict):
--> 291         return super().__call__(Chat(text, images), **kwargs)
    292     else:
    293         if images is None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1294     return next(
   1295         iter(
   1296             self.get_iterator(
   (...)
   1299         )
   1300     )
   1301 else:
-> 1302     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1308, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
-> 1308     model_inputs = self.preprocess(inputs, **preprocess_params)
   1309     model_outputs = self.forward(model_inputs, **forward_params)
   1310     outputs = self.postprocess(model_outputs, **postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:398, in ImageTextToTextPipeline.preprocess(self, inputs, truncation, padding, max_length, timeout, continue_final_message)
    396 except TypeError:
    397     kwargs.pop("legacy", None)
--> 398     model_inputs = self.processor(images=images, text=text, return_tensors=self.framework, **kwargs).to(
    399         dtype=self.torch_dtype
    400     )
    402 model_inputs["text"] = inputs_text
    404 return model_inputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/processing_mllama.py:290, in MllamaProcessor.__call__(self, images, text, audio, videos, **kwargs)
    288 n_images_in_images = [0]
    289 if images is not None:
--> 290     images = make_list_of_images(images)
    291     n_images_in_images = [len(sample) for sample in images]
    293 if text is not None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/image_processing_mllama.py:543, in make_list_of_images(images)
    541     output_images = images
    542 else:
--> 543     raise ValueError(
    544         "Invalid input type. Must be a single image, a list of images, or a list of batches of images."
    545     )
    546 return output_images

ValueError: Invalid input type. Must be a single image, a list of images, or a list of batches of images.

I tried running it just as above as well, with an image input, and that resulted in an OutOfMemoryError, which is confusing because the model size is only 166G on disk, and I'm running this in a 4x80G (i.e. 320G) H100 Lambda Labs environment.

---------------------------------------------------------------------------
OutOfMemoryError                          Traceback (most recent call last)
Cell In[6], line 23
      1 # messages = [
      2 #     {
      3 #         "role": "user",
   (...)
      9 # outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     10 # print(outputs[0]["generated_text"])
     11 messages = [
     12     {
     13         "role": "user",
   (...)
     21     }
     22 ]
---> 23 outputs = pipe(text=messages, max_new_tokens=20, return_full_text=False)
     24 print(outputs[0]["generated_text"])

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:291, in ImageTextToTextPipeline.__call__(self, images, text, **kwargs)
    285 if isinstance(text, (list, tuple, KeyDataset) if is_torch_available() else (list, tuple)) and isinstance(
    286     text[0], (list, tuple, dict)
    287 ):
    288     # We have one or more prompts in list-of-dicts format, so this is chat mode
    290     if isinstance(text[0], dict):
--> 291         return super().__call__(Chat(text, images), **kwargs)
    292     else:
    293         if images is None:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1302, in Pipeline.__call__(self, inputs, num_workers, batch_size, *args, **kwargs)
   1294     return next(
   1295         iter(
   1296             self.get_iterator(
   (...)
   1299         )
   1300     )
   1301 else:
-> 1302     return self.run_single(inputs, preprocess_params, forward_params, postprocess_params)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1309, in Pipeline.run_single(self, inputs, preprocess_params, forward_params, postprocess_params)
   1307 def run_single(self, inputs, preprocess_params, forward_params, postprocess_params):
   1308     model_inputs = self.preprocess(inputs, **preprocess_params)
-> 1309     model_outputs = self.forward(model_inputs, **forward_params)
   1310     outputs = self.postprocess(model_outputs, **postprocess_params)
   1311     return outputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/base.py:1209, in Pipeline.forward(self, model_inputs, **forward_params)
   1207     with inference_context():
   1208         model_inputs = self._ensure_tensor_on_device(model_inputs, device=self.device)
-> 1209         model_outputs = self._forward(model_inputs, **forward_params)
   1210         model_outputs = self._ensure_tensor_on_device(model_outputs, device=torch.device("cpu"))
   1211 else:

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/pipelines/image_text_to_text.py:412, in ImageTextToTextPipeline._forward(self, model_inputs, generate_kwargs)
    408 prompt_text = model_inputs.pop("text")
    409 input_ids = (
    410     model_inputs["input_ids"] if "input_ids" in model_inputs else model_inputs["decoder_input_ids"]
    411 )  # for decoder-only models
--> 412 generated_sequence = self.model.generate(**model_inputs, **generate_kwargs)
    414 return {"generated_sequence": generated_sequence, "prompt_text": prompt_text, "input_ids": input_ids}

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:2208, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2200     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2201         input_ids=input_ids,
   2202         expand_size=generation_config.num_return_sequences,
   2203         is_encoder_decoder=self.config.is_encoder_decoder,
   2204         **model_kwargs,
   2205     )
   2207     # 12. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2208     result = self._sample(
   2209         input_ids,
   2210         logits_processor=prepared_logits_processor,
   2211         stopping_criteria=prepared_stopping_criteria,
   2212         generation_config=generation_config,
   2213         synced_gpus=synced_gpus,
   2214         streamer=streamer,
   2215         **model_kwargs,
   2216     )
   2218 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2219     # 11. prepare beam search scorer
   2220     beam_scorer = BeamSearchScorer(
   2221         batch_size=batch_size,
   2222         num_beams=generation_config.num_beams,
   (...)
   2227         max_length=generation_config.max_length,
   2228     )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/generation/utils.py:3176, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, **model_kwargs)
   3173 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   3175 # forward pass to get next token
-> 3176 outputs = self(**model_inputs, return_dict=True)
   3178 # synced_gpus: don't waste resources running the code we don't need; kwargs must be updated before skipping
   3179 model_kwargs = self._update_model_kwargs_for_generation(
   3180     outputs,
   3181     model_kwargs,
   3182     is_encoder_decoder=self.config.is_encoder_decoder,
   3183 )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:170, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    168         output = module._old_forward(*args, **kwargs)
    169 else:
--> 170     output = module._old_forward(*args, **kwargs)
    171 return module._hf_hook.post_forward(module, output)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:2138, in MllamaForConditionalGeneration.forward(self, input_ids, pixel_values, aspect_ratio_mask, aspect_ratio_ids, attention_mask, cross_attention_mask, cross_attention_states, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   2135     cross_attention_mask = cross_attention_mask[:, :, cache_position]
   2136     full_text_row_masked_out_mask = full_text_row_masked_out_mask[:, :, cache_position]
-> 2138 outputs = self.language_model(
   2139     input_ids=input_ids,
   2140     attention_mask=attention_mask,
   2141     position_ids=position_ids,
   2142     cross_attention_states=cross_attention_states,
   2143     cross_attention_mask=cross_attention_mask,
   2144     full_text_row_masked_out_mask=full_text_row_masked_out_mask,
   2145     past_key_values=past_key_values,
   2146     use_cache=use_cache,
   2147     inputs_embeds=inputs_embeds,
   2148     labels=labels,
   2149     output_hidden_states=output_hidden_states,
   2150     output_attentions=output_attentions,
   2151     return_dict=return_dict,
   2152     cache_position=cache_position,
   2153     num_logits_to_keep=num_logits_to_keep,
   2154 )
   2156 return outputs

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/transformers/models/mllama/modeling_mllama.py:1948, in MllamaForCausalLM.forward(self, input_ids, attention_mask, position_ids, cross_attention_states, cross_attention_mask, full_text_row_masked_out_mask, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, return_dict, cache_position, num_logits_to_keep)
   1931 outputs = self.model(
   1932     input_ids=input_ids,
   1933     cross_attention_states=cross_attention_states,
   (...)
   1944     cache_position=cache_position,
   1945 )
   1947 hidden_states = outputs[0]
-> 1948 logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()
   1950 loss = None
   1951 if labels is not None:
   1952     # Upcast to float if we need to compute the loss to avoid potential precision issues

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1551     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1557 # If we don't have any hooks, we want to skip the rest of the logic in
   1558 # this function, and just call forward.
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
   1565     result = None

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:165, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164 def new_forward(module, *args, **kwargs):
--> 165     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    166     if module._hf_hook.no_grad:
    167         with torch.no_grad():

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/hooks.py:355, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    347         if (
    348             value is not None
    349             and self.tied_params_map is not None
    350             and value.data_ptr() in self.tied_params_map
    351             and self.execution_device not in self.tied_params_map[value.data_ptr()]
    352         ):
    353             self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
--> 355         set_module_tensor_to_device(
    356             module,
    357             name,
    358             self.execution_device,
    359             value=value,
    360             fp16_statistics=fp16_statistics,
    361             tied_params_map=self.tied_params_map,
    362         )
    364 return send_to_device(args, self.execution_device), send_to_device(
    365     kwargs, self.execution_device, skip_keys=self.skip_keys
    366 )

File ~/.cache/pypoetry/virtualenvs/fis-eda-hqM-d85b-py3.10/lib/python3.10/site-packages/accelerate/utils/modeling.py:329, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    327             module._parameters[tensor_name] = param_cls(new_value, requires_grad=old_value.requires_grad)
    328 elif isinstance(value, torch.Tensor):
--> 329     new_value = value.to(device)
    330 else:
    331     new_value = torch.tensor(value, device=device)

OutOfMemoryError: CUDA out of memory. Tried to allocate 3.91 GiB. GPU 0 has a total capacity of 79.10 GiB of which 2.12 GiB is free. Including non-PyTorch memory, this process has 76.97 GiB memory in use. Of the allocated memory 75.56 GiB is allocated by PyTorch, and 761.23 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

yonigozlan · 2024-10-17T08:02:17Z

Thanks for the feedback @knkski! Although it's not really an objective of this pipeline, I think we can try to add support and raise a warning at least, wdyt @Rocketknight1 ?
For the memory problem, that's is strange indeed, I will look into that, and if others have an idea of why this is happening feel free to chime in. Do you manage to use this model on your setup without using the pipeline?

Rocketknight1 · 2024-10-17T12:57:54Z

@yonigozlan I think that's okay! It might result in a bit of crossover with text-generation pipelines, but I think it's fine, and we can deprecate it later and officially move that functionality to text-generation if it's a problem.

yonigozlan · 2024-10-17T15:33:51Z

@Rocketknight1 @knkski , text-only inference should be supported now :)

knkski · 2024-10-18T16:42:47Z

@yonigozlan Thanks! Works great for me 🚀

I think the extra memory usage is unrelated to this PR, so ignore that 👍

Rocketknight1

Overall, this looks good! The tests seem good and the pipeline code looks clean! A lot of the code is familiar from the text-generation pipeline, with modifications for images.

The only question I have is whether it'll be confusing to have e.g. image-text-to-text as well as image-to-text and text-generation pipelines. In particular, it feels like this pipeline is almost a "superset" of text-generation, since it can handle both text completions and chat completions with templates, which means it's basically just text-generation plus image support.

That might mean we should take these changes and fold them into text-generation instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?

src/transformers/models/blip/processing_blip.py

src/transformers/tokenization_utils_base.py

src/transformers/pipelines/image_text_to_text.py

Wauplin · 2024-10-23T15:19:20Z

That might mean we should take these changes and fold them into text-generation instead. However, that might add additional inputs that would make it harder to synchronize the pipeline with the inference spec - cc @Wauplin / @LysandreJik, how annoying do you think that would be?

X-posting the slack thread (private) about that convo.
IMO better to have both text-generation and image-text-to-text to be consistent with https://huggingface.co/tasks.

tests/pipelines/test_pipelines_image_text_to_text.py

yonigozlan · 2024-10-24T22:19:08Z

There is still some issues with pipeline tests:

It seems that pipeline model tests are based on "tiny models" available on hf-internal-testing, but those tiny models don't seem to be added anymore for recent vlms, so they are not being tested. I'm not sure if this is or used to be an automatic or manual process, and if we should start adding those tiny models back again.
The Kosmos2 tiny model causes some problems: it's configuration has hyper-parameters that are not compatible with each other, namely latent_query_num=3, which is a model parameter, should be the same as num_image_tokens=64, which is a processor call argument, so can't be set via a json config file (I think?). An easy fix would be to manually change latent_query_num to 64 in the tiny model's config in hf-internal-testing, but that could make the model not so tiny anymore. Or we could skip the test altogether.

src/transformers/pipelines/__init__.py

Rocketknight1 · 2024-10-25T12:38:43Z

@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!

yonigozlan · 2024-10-25T12:52:04Z

@yonigozlan tiny models aren't automatically generated, those are all manually created. Rather than modifying an existing one (which might break existing tests), I'd suggest just making a new tiny model that fits what you want to test and uploading that to hf-internal-testing. You can ask to be added to the organization if you don't have permissions!

I see, thanks for the explanation! As for adding new tiny model, pipelines use the tiny_model_summary.json file to identify tiny models, but it looks like only one tiny model per model architecture can be present in that file, so I'm not sure how to solve the issue with the Kosmos2 tiny model without modifying the current one.

Rocketknight1 · 2024-10-25T13:01:40Z

@yonigozlan probably the easiest thing to do, in that case, is just to manually upload a new model, don't add it to tiny_model_summary, and manually set that model in the image-text-to-text tests. You shouldn't need to worry about whatever's in tiny_model_summary.json either way!

Also, I was wrong - some of the tiny models are automatically created, but in this case I think a manual one just for your pipeline will work a lot better.

ArthurZucker

Thanks for working on this! I think it's very important, thus we should try to make it a bit more simple. 🤗

src/transformers/models/donut/processing_donut.py

src/transformers/models/fuyu/processing_fuyu.py

src/transformers/pipelines/__init__.py

src/transformers/pipelines/image_text_to_text.py

yonigozlan · 2024-10-31T19:45:08Z

Thanks for all of your inputs! I'll merged this now as the remaining issues/improvements raised seem a bit out of scope for this PR.
Just to recap some of the points that were raised:

VLMs processors are not fully consistent in terms of what inputs they accept, and some of them don't catch errors that should be caught. Improvements can be made there that would benefit this pipeline as well. I'll open an issue for this to share it as a known limitation, and I'll start working on it asap :).
Donut doesn't work in this pipeline as processors are not infer in pipelines if they are not in auto.
Chat templates could be applied directly in conversational models' processor instead of users having to manually do so before making a processor call? Chat inputs could be detected as they are list of dicts.
Several pipelines have a way to handle detecting inputs in generated text, and removing or adding it. This could be unified in a util, or in generate with an added "return_input" flag.
Most recent models (and vlms in particular) don't have a "tiny" version uploaded on hf-internal-testing, which means they are not tested by the CI in the different pipelines that support them.

* Standardize image-text-to-text-models-output add post_process_image_text_to_text to chameleon and cleanup Fix legacy kwarg behavior and deprecation warning add post_process_image_text_to_text to qwen2_vl and llava_onevision Add post_process_image_text_to_text to idefics3, mllama, pixtral processor * nit var name post_process_image_text_to_text udop * nit fix deprecation warnings * Add image-text-to-text pipeline * add support for image url in chat template for pipeline * Reformat to be fully compatible with chat templates * Add tests chat template * Fix imports and tests * Add pipeline tag * change logic handling of single prompt ans multiple images * add pipeline mapping to models * fix batched inference * fix tests * Add manual batching for preprocessing * Fix outputs with nested images * Add support for all common processing kwargs * Add default padding when multiple text inputs (batch size>1) * nit change version deprecation warning * Add support for text only inference * add chat_template warnings * Add pipeline tests and add copied from post process function * Fix batched pipeline tests * nit * Fix pipeline tests blip2 * remove unnecessary max_new_tokens * revert processing kosmos2 and remove unnecessary max_new_tokens * fix pipeline tests idefics * Force try loading processor if pipeline supports it * revert load_processor change * hardcode loading only processor * remove unnecessary try except * skip imagetexttotext tests for kosmos2 as tiny model causes problems * Make code clearer * Address review comments * remove preprocessing logic from pipeline * fix fuyu * add BC resize fuyu * Move post_process_image_text_to_text to ProcessorMixin * add guard in post_process * fix zero shot object detection pipeline * add support for generator input in pipeline * nit * change default image-text-to-text model to llava onevision * fix owlv2 size dict * Change legacy deprecation warning to only show when True

yonigozlan mentioned this pull request Oct 15, 2024

Standardize image-text-to-text-models outputs #32471

Closed

26 tasks

yonigozlan marked this pull request as ready for review October 15, 2024 09:12

yonigozlan requested review from Rocketknight1, molbap and qubvel October 15, 2024 09:14

yonigozlan mentioned this pull request Oct 15, 2024

Image-Text-to-Text Support in Transformers Pipeline #34169

Open

zucchini-nlp reviewed Oct 15, 2024

View reviewed changes

yonigozlan force-pushed the add-image-text-to-text-pipeline branch from 90f00d4 to 4ac2d1f Compare October 15, 2024 14:11

yonigozlan mentioned this pull request Oct 17, 2024

Use non nested images and batched text Idefics2/3 #34222

Merged

yonigozlan force-pushed the add-image-text-to-text-pipeline branch from 035d953 to 17903d1 Compare October 17, 2024 16:03

yonigozlan mentioned this pull request Oct 18, 2024

Fix continue_final_message for image-text-to-text chat templates #34236

Merged

yonigozlan force-pushed the add-image-text-to-text-pipeline branch 2 times, most recently from 7038c52 to 46d6891 Compare October 22, 2024 13:51

yonigozlan requested a review from ArthurZucker October 22, 2024 13:52

Rocketknight1 reviewed Oct 22, 2024

View reviewed changes

Rocketknight1 reviewed Oct 23, 2024

View reviewed changes

tests/pipelines/test_pipelines_image_text_to_text.py Outdated Show resolved Hide resolved

yonigozlan force-pushed the add-image-text-to-text-pipeline branch from 31432b4 to d739c0a Compare October 24, 2024 20:56

yonigozlan commented Oct 25, 2024

View reviewed changes

src/transformers/pipelines/__init__.py Outdated Show resolved Hide resolved

ArthurZucker reviewed Oct 25, 2024

View reviewed changes

yonigozlan added 22 commits October 31, 2024 19:24

Fix pipeline tests blip2

1f2dafb

remove unnecessary max_new_tokens

d66e523

revert processing kosmos2 and remove unnecessary max_new_tokens

5056aa5

fix pipeline tests idefics

fe7e75d

Force try loading processor if pipeline supports it

b866c27

revert load_processor change

3118dac

hardcode loading only processor

065542a

remove unnecessary try except

7f583df

skip imagetexttotext tests for kosmos2 as tiny model causes problems

aad9ad4

Make code clearer

e227b83

Address review comments

8f370f4

remove preprocessing logic from pipeline

c82fe29

fix fuyu

7e1fb07

add BC resize fuyu

f581eaa

Move post_process_image_text_to_text to ProcessorMixin

4eda963

add guard in post_process

0263221

fix zero shot object detection pipeline

45c1706

add support for generator input in pipeline

2e69b97

nit

58a6fb8

change default image-text-to-text model to llava onevision

66c017c

fix owlv2 size dict

5772312

Change legacy deprecation warning to only show when True

61cc576

yonigozlan force-pushed the add-image-text-to-text-pipeline branch from c05ceb2 to 61cc576 Compare October 31, 2024 19:25

yonigozlan merged commit 203e270 into huggingface:main Oct 31, 2024
26 checks passed

mishig25 mentioned this pull request Nov 4, 2024

[transformers snippet] Support pipeline VLMs huggingface/huggingface.js#1012

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add image text to text pipeline #34170

Add image text to text pipeline #34170

yonigozlan commented Oct 15, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 15, 2024

knkski commented Oct 15, 2024

yonigozlan commented Oct 17, 2024

Rocketknight1 commented Oct 17, 2024

yonigozlan commented Oct 17, 2024

knkski commented Oct 18, 2024

Rocketknight1 left a comment

Wauplin commented Oct 23, 2024

yonigozlan commented Oct 24, 2024

Rocketknight1 commented Oct 25, 2024

yonigozlan commented Oct 25, 2024

Rocketknight1 commented Oct 25, 2024

ArthurZucker left a comment

yonigozlan commented Oct 31, 2024 •

edited

Loading

Add image text to text pipeline #34170

Add image text to text pipeline #34170

Conversation

yonigozlan commented Oct 15, 2024 • edited Loading

What does this PR do?

TODOs:

Known current limitations/bugs:

Examples of usage:

Who can review?

HuggingFaceDocBuilderDev commented Oct 15, 2024

knkski commented Oct 15, 2024

yonigozlan commented Oct 17, 2024

Rocketknight1 commented Oct 17, 2024

yonigozlan commented Oct 17, 2024

knkski commented Oct 18, 2024

Rocketknight1 left a comment

Choose a reason for hiding this comment

Wauplin commented Oct 23, 2024

yonigozlan commented Oct 24, 2024

Rocketknight1 commented Oct 25, 2024

yonigozlan commented Oct 25, 2024

Rocketknight1 commented Oct 25, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

yonigozlan commented Oct 31, 2024 • edited Loading

yonigozlan commented Oct 15, 2024 •

edited

Loading

yonigozlan commented Oct 31, 2024 •

edited

Loading