Skip to content

Conversation

@yonigozlan
Copy link
Member

@yonigozlan yonigozlan commented Oct 15, 2025

What does this PR do?

Refactor the handling of subprocessors in processors.

  • Main change is that we deduce the subprocessors from the init signature instead of having to manually add "subprocessor_class" attributes.
  • This means we can remove all attributes attribute in processors, along with all "subprocessor"_class attributes
  • We also now have one source of truth to determine which image processor will be loaded by default (the Auto sub processors classes)

This PR is a requirement for #41388, as otherwise we'd have to manually check that all image_processor_class attributes are set to "AutoImageProcessor"

Cc @ArthurZucker @Cyrilvallez @zucchini-nlp @molbap (and also @ydshieh as this might break some parts of the CI 👀, although I checked that all processor tests are passing still, except kosmos2.5 but that's because of a PIL.UnidentifiedImageError ;)).

Update: I'm seeing some tests breaking in test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀
Update 2: It looks like it's not really a problem. The only edge case that will break is if a custom processor was defined by inheriting from ProcessorMixin, without overriding __init__. ProcessorMixin used to have "feature_extractor" and "tokenizer" attributes by default, now it doesn't (which makes more sense imo)
Fixed the tests by modifying the custom processor on the hub to add an init.

🚨Breaking change:

  • If a model was saved with a processor, and another processor is used to load the checkpoint, the subprocessors loaded used to be the ones hardcoded in the other processor class definition, but now they will be the ones that were saved originally.
    For example:
processor = OwlViTProcessor.from_pretrained("some_owlv2_checkpoint")
print(type(processor.image_processor)
# Used to be OwlViTImageProcessor, will now be Owlv2ImageProcessor, which makes more sense in my opinion

"AutoFeatureExtractor": "FeatureExtractionMixin",
"AutoImageProcessor": "ImageProcessingMixin",
"AutoVideoProcessor": "BaseVideoProcessor",
"audio_tokenizer": "DacModel",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to use AutoModelForAudioTokenization, no?

class AutoModelForAudioTokenization(_BaseAutoModelClass):
_model_mapping = MODEL_FOR_AUDIO_TOKENIZATION_MAPPING

We want to standardize this in the future for other models as well

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, the attributes class attribute was indeed a bit redundant. LGTM though you might need to rebase main and check if tests pass. I just merged a non-legacy saving PR

),
),
("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this PR, but using PreTrainedTokenizer as auto-class looks funny 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes not sure if that should be the case @itazap is that expected/is it a potential issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wondering the same thing 👀

("video_llava", "VideoLlavaVideoProcessor"),
("videomae", "VideoMAEVideoProcessor"),
("vjepa2", "VJEPA2VideoProcessor"),
("video_llama_3", "VideoLlama3VideoProcessor"), # PLACEHOLDER - needs proper video processor class
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes mb, remnants from the script...

"""

attributes = ["image_processor", "tokenizer"]
valid_kwargs = ["chat_template", "num_image_tokens"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, let's delete valid_kwargs wherever it was left, it's not used anywhere iirc. Can totally do in a separate PR

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, might as well add it here

attributes = ["image_processor", "tokenizer", "qformer_tokenizer"]
image_processor_class = ("BlipImageProcessor", "BlipImageProcessorFast")
tokenizer_class = "AutoTokenizer"
qformer_tokenizer_class = "AutoTokenizer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need qformer_tokenizer_class? In the InstrutcBlipVideo I can see it is deleted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we can delete it!


attributes = ["image_processor", "char_tokenizer"]
image_processor_class = ("ViTImageProcessor", "ViTImageProcessorFast")
char_tokenizer_class = "MgpstrTokenizer"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, is it because they have a prefix before tokenizer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep my scripts failed on these 😅. Thanks for pointing it out!

Comment on lines +74 to +77
image_processor = EfficientNetImageProcessor.from_pretrained(self.tmpdirname)
image_processor.save_pretrained(self.tmpdirname)
tokenizer = BertTokenizer.from_pretrained(self.tmpdirname)
tokenizer.save_pretrained(self.tmpdirname)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this might cause some issues in tests after I merged non_legacy processor saving. We'll end up with preprocessor_config.json and processor_config.json in the same dir.

I remember some tests try to manipulate configs by loading-saving only the image processor as standalone or as part of processor, they might start failing after rebase

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, in general I think we can standardize/simplify the processor tests a lot more in ProcessorTesterMixin. Right now it's a bit of a nightmare every time we want to make a change (I was going crazy on the default to fast image procesors PR because of this). I plan to open a PR to change the tests soon!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updating the tests would help a lot, and prob we could delete many oevrwritten tests as well

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 16, 2025

Do you want me to trigger a full CI for this PR?

Better to have it , @zucchini-nlp can confirm it's helpful 😅 .

Let me know once it's ready (and you want me to trigger)

@ArthurZucker ArthurZucker removed their request for review October 16, 2025 14:10
@yonigozlan
Copy link
Member Author

I think you can now @ydshieh , thank you!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@zucchini-nlp
Copy link
Member

Update: I'm seeing some tests breaking in test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀

Didn't see this at first. Looking at the test and remote code, I don't think that will be very breaking. Haven't seen any model in the hub without an __init__ until today, so prob we can simply adjust the remote code in test repo

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 17, 2025

I think you can now @ydshieh , thank you!

CircleCI still some failing jobs, so I will wait until it's ✅ here (ping me when it's ✅ 🙏 )

@yonigozlan yonigozlan changed the title [v5] Refactor subprocessors handling in processors [v5] 🚨Refactor subprocessors handling in processors Oct 17, 2025
@yonigozlan
Copy link
Member Author

Ah sorry about that @ydshieh , it should be good now!

@ydshieh
Copy link
Collaborator

ydshieh commented Oct 17, 2025

ok, i will trigger, but only share the reports on Monday.

@yonigozlan
Copy link
Member Author

ok, i will trigger, but only share the reports on Monday.

Sounds good thanks!

ydshieh added a commit that referenced this pull request Oct 20, 2025
[v5] 🚨Refactor subprocessors handling in processors #41633

dummy
Copy link
Contributor

@molbap molbap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be a nice cleanup 😁 left a few comments!

),
),
("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was wondering the same thing 👀

Comment on lines 781 to 805
def save_pretrained(
self, save_directory, push_to_hub: bool = False, exclude_attributes: Optional[list[str]] = None, **kwargs
):
"""
Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it
can be reloaded using the [`~ProcessorMixin.from_pretrained`] method.
<Tip>
This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
[`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`]. Please refer to the docstrings of the
methods above for more information.
</Tip>
Args:
save_directory (`str` or `os.PathLike`):
Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
be created if it does not exist).
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
namespace).
exclude_attributes (`list[str]`, *optional*):
A list of attributes to exclude from saving.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why we need to exclude the attributes but I don't think it's enough of a motivation to add an argument to a very fundamental function. It is not a minimal user interface.
You could get a helper like

    def get_attributes_for_save(self) -> list[str]:
        # default: same as runtime attributes
        return list(self.get_attributes())

That can be overriden in each place where you want to remove some attributes, that makes the cutting of attributes cleaner and doesn't grow the public API. (for instance, there could be other solutions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The get_attributes util you wrote could be put to that use as well

ydshieh added a commit that referenced this pull request Oct 21, 2025
[v5] 🚨Refactor subprocessors handling in processors #41633

dummy
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, aria, auto, aya_vision, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants