-
Couldn't load subscription status.
- Fork 31k
[v5] 🚨Refactor subprocessors handling in processors #41633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[v5] 🚨Refactor subprocessors handling in processors #41633
Conversation
| "AutoFeatureExtractor": "FeatureExtractionMixin", | ||
| "AutoImageProcessor": "ImageProcessingMixin", | ||
| "AutoVideoProcessor": "BaseVideoProcessor", | ||
| "audio_tokenizer": "DacModel", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to use AutoModelForAudioTokenization, no?
transformers/src/transformers/models/auto/modeling_auto.py
Lines 2248 to 2249 in e20df45
| class AutoModelForAudioTokenization(_BaseAutoModelClass): | |
| _model_mapping = MODEL_FOR_AUDIO_TOKENIZATION_MAPPING |
We want to standardize this in the future for other models as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, the attributes class attribute was indeed a bit redundant. LGTM though you might need to rebase main and check if tests pass. I just merged a non-legacy saving PR
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this PR, but using PreTrainedTokenizer as auto-class looks funny 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes not sure if that should be the case @itazap is that expected/is it a potential issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering the same thing 👀
| ("video_llava", "VideoLlavaVideoProcessor"), | ||
| ("videomae", "VideoMAEVideoProcessor"), | ||
| ("vjepa2", "VJEPA2VideoProcessor"), | ||
| ("video_llama_3", "VideoLlama3VideoProcessor"), # PLACEHOLDER - needs proper video processor class |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes mb, remnants from the script...
| """ | ||
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| valid_kwargs = ["chat_template", "num_image_tokens"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, let's delete valid_kwargs wherever it was left, it's not used anywhere iirc. Can totally do in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, might as well add it here
| attributes = ["image_processor", "tokenizer", "qformer_tokenizer"] | ||
| image_processor_class = ("BlipImageProcessor", "BlipImageProcessorFast") | ||
| tokenizer_class = "AutoTokenizer" | ||
| qformer_tokenizer_class = "AutoTokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need qformer_tokenizer_class? In the InstrutcBlipVideo I can see it is deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed we can delete it!
|
|
||
| attributes = ["image_processor", "char_tokenizer"] | ||
| image_processor_class = ("ViTImageProcessor", "ViTImageProcessorFast") | ||
| char_tokenizer_class = "MgpstrTokenizer" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question, is it because they have a prefix before tokenizer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep my scripts failed on these 😅. Thanks for pointing it out!
| image_processor = EfficientNetImageProcessor.from_pretrained(self.tmpdirname) | ||
| image_processor.save_pretrained(self.tmpdirname) | ||
| tokenizer = BertTokenizer.from_pretrained(self.tmpdirname) | ||
| tokenizer.save_pretrained(self.tmpdirname) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this might cause some issues in tests after I merged non_legacy processor saving. We'll end up with preprocessor_config.json and processor_config.json in the same dir.
I remember some tests try to manipulate configs by loading-saving only the image processor as standalone or as part of processor, they might start failing after rebase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, in general I think we can standardize/simplify the processor tests a lot more in ProcessorTesterMixin. Right now it's a bit of a nightmare every time we want to make a change (I was going crazy on the default to fast image procesors PR because of this). I plan to open a PR to change the tests soon!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updating the tests would help a lot, and prob we could delete many oevrwritten tests as well
|
Do you want me to trigger a full CI for this PR? Better to have it , @zucchini-nlp can confirm it's helpful 😅 . Let me know once it's ready (and you want me to trigger) |
|
I think you can now @ydshieh , thank you! |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Didn't see this at first. Looking at the test and remote code, I don't think that will be very breaking. Haven't seen any model in the hub without an |
CircleCI still some failing jobs, so I will wait until it's ✅ here (ping me when it's ✅ 🙏 ) |
|
Ah sorry about that @ydshieh , it should be good now! |
|
ok, i will trigger, but only share the reports on Monday. |
Sounds good thanks! |
[v5] 🚨Refactor subprocessors handling in processors #41633 dummy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be a nice cleanup 😁 left a few comments!
| ), | ||
| ), | ||
| ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), | ||
| ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was wondering the same thing 👀
| def save_pretrained( | ||
| self, save_directory, push_to_hub: bool = False, exclude_attributes: Optional[list[str]] = None, **kwargs | ||
| ): | ||
| """ | ||
| Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it | ||
| can be reloaded using the [`~ProcessorMixin.from_pretrained`] method. | ||
| <Tip> | ||
| This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and | ||
| [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`]. Please refer to the docstrings of the | ||
| methods above for more information. | ||
| </Tip> | ||
| Args: | ||
| save_directory (`str` or `os.PathLike`): | ||
| Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will | ||
| be created if it does not exist). | ||
| push_to_hub (`bool`, *optional*, defaults to `False`): | ||
| Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the | ||
| repository you want to push to with `repo_id` (will default to the name of `save_directory` in your | ||
| namespace). | ||
| exclude_attributes (`list[str]`, *optional*): | ||
| A list of attributes to exclude from saving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why we need to exclude the attributes but I don't think it's enough of a motivation to add an argument to a very fundamental function. It is not a minimal user interface.
You could get a helper like
def get_attributes_for_save(self) -> list[str]:
# default: same as runtime attributes
return list(self.get_attributes())That can be overriden in each place where you want to remove some attributes, that makes the cutting of attributes cleaner and doesn't grow the public API. (for instance, there could be other solutions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The get_attributes util you wrote could be put to that use as well
[v5] 🚨Refactor subprocessors handling in processors #41633 dummy
…m/yonigozlan/transformers into remove-attributes-from-processors
|
[For maintainers] Suggested jobs to run (before merge) run-slow: align, altclip, aria, auto, aya_vision, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp |
What does this PR do?
Refactor the handling of subprocessors in processors.
attributesattribute in processors, along with all"subprocessor"_classattributesThis PR is a requirement for #41388, as otherwise we'd have to manually check that all
image_processor_classattributes are set to "AutoImageProcessor"Cc @ArthurZucker @Cyrilvallez @zucchini-nlp @molbap (and also @ydshieh as this might break some parts of the CI 👀, although I checked that all processor tests are passing still, except kosmos2.5 but that's because of a PIL.UnidentifiedImageError ;)).
Update: I'm seeing some tests breaking in
test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀Update 2: It looks like it's not really a problem. The only edge case that will break is if a custom processor was defined by inheriting from
ProcessorMixin, without overriding__init__. ProcessorMixin used to have "feature_extractor" and "tokenizer" attributes by default, now it doesn't (which makes more sense imo)Fixed the tests by modifying the custom processor on the hub to add an init.
🚨Breaking change:
For example: