Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniform kwargs for processors #31911

Open
39 of 40 tasks
zucchini-nlp opened this issue Jul 11, 2024 · 44 comments
Open
39 of 40 tasks

Uniform kwargs for processors #31911

zucchini-nlp opened this issue Jul 11, 2024 · 44 comments
Labels
contributions-welcome Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!

Comments

@zucchini-nlp
Copy link
Member

zucchini-nlp commented Jul 11, 2024

Feature request

We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a TypedDict for each modality to keep track of which kwargs are accepted.

The initial design is merged and an example model is modified to follow the new uniform processor kwargs in #31198. Also #31197 has two more examples with standardized API.

This design has to be shipped to all the processors in Transformers, and appreciate contributions.
Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:

Note: For now we'll start with image or image+text, #31368 is an ongoing PR that has also audio processor standardization

Motivation

.

Your contribution

.

@zucchini-nlp zucchini-nlp added Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want! contributions-welcome labels Jul 11, 2024
@zucchini-nlp
Copy link
Member Author

cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline

@davidgxue
Copy link
Contributor

I can take CLIP and LLaVa

@zucchini-nlp
Copy link
Member Author

@davidgxue okey, feel free to open a PR when it's ready.

@OmarManzoor
Copy link
Contributor

I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions

# Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__

@zucchini-nlp
Copy link
Member Author

@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :)

I'll add Blip to the list and assign to you then

@molbap
Copy link
Contributor

molbap commented Jul 15, 2024

@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here #31198 and here #31368 , BLIP, BLIP-2, Donut and a couple more are already handled

@OmarManzoor
Copy link
Contributor

Please check the original PRs here #31198 and here #31368 , BLIP, BLIP-2, Donut and a couple more are already handled

Thank you for clarifying.

@zucchini-nlp
Copy link
Member Author

@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these.

If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know!

@leloykun
Copy link
Contributor

thanks @zucchini-nlp!

I can take LayoutLM (1, 2, 3) & Chameleon

@bhuvanmdev
Copy link
Contributor

bhuvanmdev commented Jul 28, 2024

I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?

# Copied from transformers.models.owlvit.processing_owlvit.OwlViTProcessor.__call__ with OWLViT->OWLv2

@MnCSSJ4x
Copy link

MnCSSJ4x commented Jul 28, 2024

Can I take dino and Paligemma if no ones working on it?

@zucchini-nlp
Copy link
Member Author

@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from

@MnCSSJ4x sure

@MnCSSJ4x
Copy link

MnCSSJ4x commented Jul 30, 2024

@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there.

@zucchini-nlp
Copy link
Member Author

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

@MnCSSJ4x
Copy link

MnCSSJ4x commented Aug 1, 2024

@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and ProcessorTesterMixin to it, so that new changes are all tested

Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same.

@leloykun leloykun mentioned this issue Aug 6, 2024
5 tasks
@yonigozlan
Copy link
Member

I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for #32471 .

@zucchini-nlp
Copy link
Member Author

@yonigozlan Thanks, that would be great!

@yonigozlan
Copy link
Member

No problem @davidgxue! I will get started on LLaVa then

@leloykun
Copy link
Contributor

leloykun commented Aug 16, 2024

Summary of my progress:

Model Status Has Tests? Special Args PR
Chameleon Ready Yes - #32181
--- --- --- --- ---
AltCLIP Ready Yes - #32845
Flava Ready Yes - #32845
Git Ready Yes - #32845
InstructBlipVideo Ready Yes - #32845
LLaVa-NeXT-Video Ready Yes - #32845
MGP Ready Yes - #32845
Siglip Ready Yes - #32845
TVP Ready Yes - #32845
VideoLLaVa Ready Yes - #32845
VILT Ready Yes - #32845
X-CLIP Ready Yes - #32845
--- --- --- --- ---
LayoutLMv2 Ready Yes text_pair, boxes, word_labels #32180
LayoutLMv3 Ready Yes text_pair, boxes, word_labels #32180
LayoutXLM Ready Yes text_pair, boxes, word_labels #32180
--- --- --- --- ---
ClipSeg Ready Yes visual_prompt #32841
Nougat Ready Yes text_pair, text_target, text_pair_target(apparently these are in the tokenizer base class) #32841
OwlV2 Ready Yes query_images #32841
OwlVIT Ready Yes query_images #32841
--- --- --- --- ---
Clap Not Ready No ?? #32906
CLVP Not Ready No ?? #32906
MusicGen Melody Not Ready No ?? #32906
PopPiano Not Ready No ?? #32906
Qwen2 Audio Not Ready No ?? #32906
Seamless M4T Not Ready No ?? #32906
SpeechT5 Ready Yes - #32906
Wav2Vec2 Bert Ready Yes - #32906

#32181 & #32845 are now ready for review. I could also decouple InstructBlipVideo, LLaVa-NeXT-Video, Siglip, and VideoLLaVa to a separate PR just to get them out--just lemme know if there's a need to rush.

The rest have special args that we still have to figure out how to handle. cc @yonigozlan

Update: all of the PRs are now ready for review

@leloykun
Copy link
Contributor

leloykun commented Aug 16, 2024

Processors with weird output keys:

Model weird key to expected key mapping
LayoutLMv2 image -> pixel_values
LayoutXLM image -> pixel_values
MGP labels -> input_ids
Nougat labels -> input_ids
TVP pixel_values -> pixel_values_videos
VideoLlava pixel_values_images -> pixel_values
X-Clip pixel_values -> pixel_values_videos

@molbap @zucchini-nlp what do you think of standardizing/uniformizing these too?

@leloykun
Copy link
Contributor

leloykun commented Aug 16, 2024

Models missing from this list:

  • AltClip
  • Flava
  • GIT
  • MGP
  • OneFormer (special args: task_inputs & segmentation_maps; has some weird logic with task_inputs)
  • SAM (special args: segmentation_maps, input_points, input_labels, & input_boxes)
  • TrOCR (has some weird _in_target_context_manager logic; is this deprecated?)
  • TVP
  • VisionTextDualEncoder
  • ViLT
  • X-CLIP

The rest either doesn't have an image processor or only has a feature_extractor_class

@leloykun
Copy link
Contributor

@molbap @yonigozlan I wanna raise this here so our implementations would be more consistent:

I've implemented a saner way to handle special processor call args for backwards compatibility that doesn't involve re-using unrelated args (e.g. audio & video) and doesn't need extra arguments like backwards_compatibility_placeholder_arg.

Tl;dr: I followed @molbap's advise to capture them using *args instead and added prepare_and_validate_optional_call_args to auto convert them to kwargs which we can then pass to _merge_kwargs along with the other arguments.

See my implementation here #32841 and here #32180


For everyone else: the "special arguments" here are arguments that carry data from user input, but aren't named text, images, audio, nor videos nor are configs to the tokenizer, image processor, etc. For example, the bounding boxes for LayoutLM* models, the visual prompt for ClipSeg, etc.

The problem with these args is that some users pass them as positional arguments to the processors. So, if we wanna restrict the processor call arguments to only those four and kwargs, then we're gonna have problems handling these special arguments.


Now, we only need to do:

  1. Add the ModelProcessorKwargs class (same as before)
  2. Add optional_call_args = [...] as an attribute to the processor class
  3. Add *args, to where in the call signature the special arguments were before (e.g. for LayoutLM* models, it's right after text and images)
  4. Add **self.prepare_and_validate_optional_call_args(*args), as an argument to self._merge_kwargs. E.g.:
output_kwargs = self._merge_kwargs(
    CLIPSegProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    **kwargs,
    **self.prepare_and_validate_optional_call_args(*args),
)

Alternatively, I could move prepare_and_validate_optional_call_args to _merge_kwargs and perhaps rename it to _merge_args_and_kwargs. This way, the interface would be something like this instead:

output_kwargs = self._merge_args_and_kwargs(
    CLIPSegProcessorKwargs,
    tokenizer_init_kwargs=self.tokenizer.init_kwargs,
    *args,
    **kwargs,
)

Lemme know what you think

@yonigozlan
Copy link
Member

That looks great to me @leloykun thanks a lot for working on that! I personally like it as you've done it in #32841 and #32180 with self.prepare_and_validate_optional_call_args(*args), as it makes it easy to remove once this behavior is fully deprecated. For that reason I don't think we should replace _merge_kwargs by a _merge_args_and_kwargs.
If others agree, I will add this to Udop and other models I have been working on if they need them.

@zucchini-nlp
Copy link
Member Author

For me it looks an good workaround, and should handle BC correctly. I left comments in the corresponding PRs, and I agree about renaming _merge_kwargs.

I'll leave some general questions/comments here, as there are so many PRs currently and I might miss some of them. So, as part of standardization we have to

  • use order as image, text in processors. I see confusion around why we're swapping these. Since I wasn't very closely working on standardization, my guess is that it's a common pattern from most processors and might help with pipelines. @yonigozlan if you have any ideas as you've been working closely with @molbap
  • IMO we have to return BatchFeature and not BatchEncoding in processor's call, as was done in some PRs already. I don't see any reason why BatchEncoding so this shouldn't cause issues
  • use the above proposed way to deprecate extra args and add a major version after which positional args won;t be supported, that can be v4.47

@leloykun
Copy link
Contributor

@yonigozlan @molbap I don't see any good reason why we should swap the args. It just adds complications with backwards compatibility. And besides, we can just pass them as keyword arguments in the pipelines.

@yonigozlan
Copy link
Member

To me, swapping the args is part of the effort to standardize the processors. Right now, we have some VLMs where the processor takes images first and others where it takes text first. Even if we discourage using positional arguments for processor calls, I imagine most users will still use them for images and text. Having to guess which comes first depending on the model doesn't seem ideal.

That said, since the argument swapping isn't a blocker for the image-text-to-text pipeline, I suggest we remove it for now to get this merged as soon as possible. The backward compatibility is indeed tricky to manage, and the current checks feel a bit too hacky for a merge. We can open a separate PR later to focus on argument swapping if we decide it's necessary.

How does that sound, @zucchini-nlp, @leloykun?

@leloykun
Copy link
Contributor

leloykun commented Aug 19, 2024

@yonigozlan yup, that sounds good.

btw, @yonigozlan @zucchini-nlp @molbap what do you think of forcing the use of keyword arguments in future versions? I.e. having this signature?

def __call__(
    self,
    *,
    text: ...,
    images: ...,
    audio: ...,
    videos: ...,
    **kwargs: Unpack[...],
) -> BatchFeature:

@amyeroberts
Copy link
Collaborator

@leloykun This is something we could consider for e.g. v5 of transformers but I wouldn't enforce it at the moment as we're going through minor versions: this would break a lot of code for a lot of people.

@zucchini-nlp
Copy link
Member Author

Afaik the order is swapped only in some VLMs and we want to follow image, text order in the end. Indeed there are many BC handlings happening, but swapping positions doesn't seem very hard to handle. Also, it won't flood cmd with warnings because processors with swapped order usually have no other cases to handle.

So, I am for changing order of args as part of standardization, and address comments if there are any to make the checks more reliable. If you are more comfortable with having a separate PR, I'm okay with separating out VLMs from current PR and opening a new one.

What I suggest for faster iteration is to first review and merge one model, for ex, LLaVa has a PR for itself now. After everyone is happy with it, we can merge and copy changes in other models. Same for handling "special positional args", I guess @leloykun had a PR with one model only.

@yonigozlan
Copy link
Member

@zucchini-nlp Sounds good! I will probably do a separate PR for each model that needs the argument switch, just because it adds a lot of noisy changes in tests, docstrings etc. I will also work on finding a better way to support BC and implement it first in the LLaVa PR. Will ping you when that's done :)

@leloykun
Copy link
Contributor

Also started uniformization of kwargs of processors of audio-text models here: #32906

It's still a draft on top of #32845 but SpeechT5 & Wav2Vec2 Bert should be done now

@leloykun
Copy link
Contributor

hmmm... now that I think about it... why is it audio instead of audios?

@molbap
Copy link
Contributor

molbap commented Sep 18, 2024

@leloykun I chose audio instead of audios because the second one, despite being a valid English word, is barely used. videos is valid because it is a shortcut for videotapes. So videos is the countable form, but there is no much sense in having a countable form for audio, so the uncountable plural form audio is kept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributions-welcome Good Second Issue Issues that are more difficult to do than "Good First" issues - give it a try if you want!
Projects
None yet
Development

No branches or pull requests

10 participants