-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uniform kwargs for processors #31911
Comments
cc @molbap @NielsRogge , I added only models I see commonly to the list and all VLMs to unblock the pipeline |
I can take CLIP and LLaVa |
@davidgxue okey, feel free to open a PR when it's ready. |
I would like to work on BLIP-2. Just to clarify we only need to change BLIP-2 right and not BLIP? Because there is a comment which mentions # Copied from transformers.models.blip.processing_blip.BlipProcessor.__call__ |
@OmarManzoor my bad, forgot to add Blip to the list. You can work on Blip and all changes from BLIP will be ported to BLIP2 automatically :) I'll add Blip to the list and assign to you then |
@OmarManzoor @zucchini-nlp missed it, I already started work on a few models here. Please check the original PRs here #31198 and here #31368 , BLIP, BLIP-2, Donut and a couple more are already handled |
@leloykun this is one of the trackers we have for start. There'a another PR for standardizing VLM from generation perspective. And unfortunately other tasks will be blocked by these. If you want to work on this task or maybe in making a wrapper for VLMTokenizer, let me know! |
thanks @zucchini-nlp! I can take LayoutLM (1, 2, 3) & Chameleon |
I can take owlv2 and vit. But for owlv2 there are multiple helper functions and classes that are being copied from owlvit, so does that mean i need to work on owlvit?
|
Can I take dino and Paligemma if no ones working on it? |
@bhuvanmdev yes, if owlvit processing code is identical to owl, it will be simply copied from @MnCSSJ4x sure |
@zucchini-nlp I started working on Paligemma and tried to follow the PRs mentioned here. In paligemma there is not test file for the processor. Do I need to add those tests (if that's the case, please point me to how can I do that) and also can that be used directly to check if the changes are non breaking? I can raise a temp PR so that we can discuss it there. |
@MnCSSJ4x yes, feel free to open a PR so that we can discuss it there. And yes, in that case we need to add the test file and |
Thanks I have created a PR #32377 and tagged you there. Please let me know how can I get started regarding testing the same. |
I can work on the remaining image-text-to-text models (Fuyu, Idefics/2, InstructBlip, Kosmos-2, LLaVa-NeXT) as I have already been working on their processors for #32471 . |
@yonigozlan Thanks, that would be great! |
No problem @davidgxue! I will get started on LLaVa then |
Summary of my progress:
Update: all of the PRs are now ready for review |
Processors with weird output keys:
@molbap @zucchini-nlp what do you think of standardizing/uniformizing these too? |
Models missing from this list:
The rest either doesn't have an image processor or only has a |
@molbap @yonigozlan I wanna raise this here so our implementations would be more consistent: I've implemented a saner way to handle special processor call args for backwards compatibility that doesn't involve re-using unrelated args (e.g. Tl;dr: I followed @molbap's advise to capture them using See my implementation here #32841 and here #32180 For everyone else: the "special arguments" here are arguments that carry data from user input, but aren't named The problem with these args is that some users pass them as positional arguments to the processors. So, if we wanna restrict the processor call arguments to only those four and Now, we only need to do:
output_kwargs = self._merge_kwargs(
CLIPSegProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
**self.prepare_and_validate_optional_call_args(*args),
) Alternatively, I could move output_kwargs = self._merge_args_and_kwargs(
CLIPSegProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
*args,
**kwargs,
) Lemme know what you think |
That looks great to me @leloykun thanks a lot for working on that! I personally like it as you've done it in #32841 and #32180 with |
For me it looks an good workaround, and should handle BC correctly. I left comments in the corresponding PRs, and I agree about renaming I'll leave some general questions/comments here, as there are so many PRs currently and I might miss some of them. So, as part of standardization we have to
|
@yonigozlan @molbap I don't see any good reason why we should swap the args. It just adds complications with backwards compatibility. And besides, we can just pass them as keyword arguments in the pipelines. |
To me, swapping the args is part of the effort to standardize the processors. Right now, we have some VLMs where the processor takes images first and others where it takes text first. Even if we discourage using positional arguments for processor calls, I imagine most users will still use them for images and text. Having to guess which comes first depending on the model doesn't seem ideal. That said, since the argument swapping isn't a blocker for the image-text-to-text pipeline, I suggest we remove it for now to get this merged as soon as possible. The backward compatibility is indeed tricky to manage, and the current checks feel a bit too hacky for a merge. We can open a separate PR later to focus on argument swapping if we decide it's necessary. How does that sound, @zucchini-nlp, @leloykun? |
@yonigozlan yup, that sounds good. btw, @yonigozlan @zucchini-nlp @molbap what do you think of forcing the use of keyword arguments in future versions? I.e. having this signature? def __call__(
self,
*,
text: ...,
images: ...,
audio: ...,
videos: ...,
**kwargs: Unpack[...],
) -> BatchFeature: |
@leloykun This is something we could consider for e.g. v5 of transformers but I wouldn't enforce it at the moment as we're going through minor versions: this would break a lot of code for a lot of people. |
Afaik the order is swapped only in some VLMs and we want to follow So, I am for changing order of args as part of standardization, and address comments if there are any to make the checks more reliable. If you are more comfortable with having a separate PR, I'm okay with separating out VLMs from current PR and opening a new one. What I suggest for faster iteration is to first review and merge one model, for ex, LLaVa has a PR for itself now. After everyone is happy with it, we can merge and copy changes in other models. Same for handling "special positional args", I guess @leloykun had a PR with one model only. |
@zucchini-nlp Sounds good! I will probably do a separate PR for each model that needs the argument switch, just because it adds a lot of noisy changes in tests, docstrings etc. I will also work on finding a better way to support BC and implement it first in the LLaVa PR. Will ping you when that's done :) |
hmmm... now that I think about it... why is it |
@leloykun I chose |
Feature request
We want to standardize the logic flow through Processor classes. Since processors can have different kwargs depending on the model and modality, we are adding a
TypedDict
for each modality to keep track of which kwargs are accepted.The initial design is merged and an example model is modified to follow the new uniform processor kwargs in #31198. Also #31197 has two more examples with standardized API.
This design has to be shipped to all the processors in Transformers, and appreciate contributions.
Below is an incomplete list of models that need standardization, feel free to add a model if it's missing:
Note: For now we'll start with image or image+text, #31368 is an ongoing PR that has also audio processor standardization
Motivation
.
Your contribution
.
The text was updated successfully, but these errors were encountered: