[v5] 🚨Refactor subprocessors handling in processors #41633

yonigozlan · 2025-10-15T17:26:37Z

What does this PR do?

Refactor the handling of subprocessors in processors.

Main change is that we deduce the subprocessors from the init signature instead of having to manually add "subprocessor_class" attributes.
This means we can remove all attributes attribute in processors, along with all "subprocessor"_class attributes
We also now have one source of truth to determine which image processor will be loaded by default (the Auto sub processors classes)

This PR is a requirement for #41388, as otherwise we'd have to manually check that all image_processor_class attributes are set to "AutoImageProcessor"

Cc @ArthurZucker @Cyrilvallez @zucchini-nlp @molbap (and also @ydshieh as this might break some parts of the CI 👀, although I checked that all processor tests are passing still, except kosmos2.5 but that's because of a PIL.UnidentifiedImageError ;)).

Update: I'm seeing some tests breaking in test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀
Update 2: It looks like it's not really a problem. The only edge case that will break is if a custom processor was defined by inheriting from ProcessorMixin, without overriding __init__. ProcessorMixin used to have "feature_extractor" and "tokenizer" attributes by default, now it doesn't (which makes more sense imo)
Fixed the tests by modifying the custom processor on the hub to add an init.

🚨Breaking change:

If a model was saved with a processor, and another processor is used to load the checkpoint, the subprocessors loaded used to be the ones hardcoded in the other processor class definition, but now they will be the ones that were saved originally.
For example:

processor = OwlViTProcessor.from_pretrained("some_owlv2_checkpoint")
print(type(processor.image_processor)
# Used to be OwlViTImageProcessor, will now be Owlv2ImageProcessor, which makes more sense in my opinion

…asses

vasqu · 2025-10-15T17:45:40Z

src/transformers/processing_utils.py

-    "AutoFeatureExtractor": "FeatureExtractionMixin",
-    "AutoImageProcessor": "ImageProcessingMixin",
-    "AutoVideoProcessor": "BaseVideoProcessor",
+    "audio_tokenizer": "DacModel",


We should be able to use AutoModelForAudioTokenization, no?

transformers/src/transformers/models/auto/modeling_auto.py

Lines 2248 to 2249 in e20df45

class AutoModelForAudioTokenization(_BaseAutoModelClass):

_model_mapping = MODEL_FOR_AUDIO_TOKENIZATION_MAPPING

We want to standardize this in the future for other models as well

zucchini-nlp

Nice, the attributes class attribute was indeed a bit redundant. LGTM though you might need to rebase main and check if tests pass. I just merged a non-legacy saving PR

zucchini-nlp · 2025-10-16T09:13:22Z

src/transformers/models/auto/tokenization_auto.py

            ),
        ),
        ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
+        ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),


not related to this PR, but using PreTrainedTokenizer as auto-class looks funny 😄

Yes not sure if that should be the case @itazap is that expected/is it a potential issue?

Was wondering the same thing 👀

zucchini-nlp · 2025-10-16T09:14:17Z

src/transformers/models/auto/video_processing_auto.py

            ("video_llava", "VideoLlavaVideoProcessor"),
            ("videomae", "VideoMAEVideoProcessor"),
            ("vjepa2", "VJEPA2VideoProcessor"),
+            ("video_llama_3", "VideoLlama3VideoProcessor"),  # PLACEHOLDER - needs proper video processor class


we have it already, no?

https://github.com/huggingface/transformers/blob/main/src/transformers/models/video_llama_3/video_processing_video_llama_3.py

Yes mb, remnants from the script...

zucchini-nlp · 2025-10-16T09:15:53Z

src/transformers/models/deepseek_vl/processing_deepseek_vl.py

    """

-    attributes = ["image_processor", "tokenizer"]
    valid_kwargs = ["chat_template", "num_image_tokens"]


oh, let's delete valid_kwargs wherever it was left, it's not used anywhere iirc. Can totally do in a separate PR

Nice, might as well add it here

src/transformers/models/evolla/processing_evolla.py

zucchini-nlp · 2025-10-16T09:18:10Z

src/transformers/models/instructblip/processing_instructblip.py

-    attributes = ["image_processor", "tokenizer", "qformer_tokenizer"]
-    image_processor_class = ("BlipImageProcessor", "BlipImageProcessorFast")
-    tokenizer_class = "AutoTokenizer"
    qformer_tokenizer_class = "AutoTokenizer"


do we need qformer_tokenizer_class? In the InstrutcBlipVideo I can see it is deleted

Indeed we can delete it!

zucchini-nlp · 2025-10-16T09:19:37Z

src/transformers/models/mgp_str/processing_mgp_str.py


-    attributes = ["image_processor", "char_tokenizer"]
-    image_processor_class = ("ViTImageProcessor", "ViTImageProcessorFast")
    char_tokenizer_class = "MgpstrTokenizer"


same question, is it because they have a prefix before tokenizer?

yep my scripts failed on these 😅. Thanks for pointing it out!

src/transformers/processing_utils.py

zucchini-nlp · 2025-10-16T09:27:11Z

tests/models/align/test_processing_align.py

+        image_processor = EfficientNetImageProcessor.from_pretrained(self.tmpdirname)
+        image_processor.save_pretrained(self.tmpdirname)
+        tokenizer = BertTokenizer.from_pretrained(self.tmpdirname)
+        tokenizer.save_pretrained(self.tmpdirname)


i think this might cause some issues in tests after I merged non_legacy processor saving. We'll end up with preprocessor_config.json and processor_config.json in the same dir.

I remember some tests try to manipulate configs by loading-saving only the image processor as standalone or as part of processor, they might start failing after rebase

Ah yes, in general I think we can standardize/simplify the processor tests a lot more in ProcessorTesterMixin. Right now it's a bit of a nightmare every time we want to make a change (I was going crazy on the default to fast image procesors PR because of this). I plan to open a PR to change the tests soon!

Thanks, updating the tests would help a lot, and prob we could delete many oevrwritten tests as well

ydshieh · 2025-10-16T13:49:47Z

Do you want me to trigger a full CI for this PR?

Better to have it , @zucchini-nlp can confirm it's helpful 😅 .

Let me know once it's ready (and you want me to trigger)

…rom-processors

yonigozlan · 2025-10-16T17:15:59Z

I think you can now @ydshieh , thank you!

HuggingFaceDocBuilderDev · 2025-10-16T17:24:42Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

zucchini-nlp · 2025-10-17T07:50:35Z

Update: I'm seeing some tests breaking in test_processor_auto.py, related to registering custom processors and subprocessors in transformers. How used is this and can we break it slightly for v5? 👀

Didn't see this at first. Looking at the test and remote code, I don't think that will be very breaking. Haven't seen any model in the hub without an __init__ until today, so prob we can simply adjust the remote code in test repo

ydshieh · 2025-10-17T08:39:12Z

I think you can now @ydshieh , thank you!

CircleCI still some failing jobs, so I will wait until it's ✅ here (ping me when it's ✅ 🙏 )

… (temporarily)

…rom-processors

yonigozlan · 2025-10-17T14:04:54Z

Ah sorry about that @ydshieh , it should be good now!

ydshieh · 2025-10-17T14:59:29Z

ok, i will trigger, but only share the reports on Monday.

yonigozlan · 2025-10-17T15:07:16Z

ok, i will trigger, but only share the reports on Monday.

Sounds good thanks!

[v5] 🚨Refactor subprocessors handling in processors #41633 dummy

molbap

That will be a nice cleanup 😁 left a few comments!

molbap · 2025-10-20T13:29:17Z

src/transformers/models/auto/tokenization_auto.py

            ),
        ),
        ("smollm3", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
+        ("smolvlm", ("PreTrainedTokenizer", "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),


Was wondering the same thing 👀

molbap · 2025-10-20T13:36:52Z

src/transformers/processing_utils.py

+    def save_pretrained(
+        self, save_directory, push_to_hub: bool = False, exclude_attributes: Optional[list[str]] = None, **kwargs
+    ):
        """
        Saves the attributes of this processor (feature extractor, tokenizer...) in the specified directory so that it
        can be reloaded using the [`~ProcessorMixin.from_pretrained`] method.

        <Tip>

        This class method is simply calling [`~feature_extraction_utils.FeatureExtractionMixin.save_pretrained`] and
        [`~tokenization_utils_base.PreTrainedTokenizerBase.save_pretrained`]. Please refer to the docstrings of the
        methods above for more information.

        </Tip>

        Args:
            save_directory (`str` or `os.PathLike`):
                Directory where the feature extractor JSON file and the tokenizer files will be saved (directory will
                be created if it does not exist).
            push_to_hub (`bool`, *optional*, defaults to `False`):
                Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
                repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
                namespace).
+            exclude_attributes (`list[str]`, *optional*):
+                A list of attributes to exclude from saving.


I see why we need to exclude the attributes but I don't think it's enough of a motivation to add an argument to a very fundamental function. It is not a minimal user interface.
You could get a helper like

def get_attributes_for_save(self) -> list[str]: # default: same as runtime attributes return list(self.get_attributes())

That can be overriden in each place where you want to remove some attributes, that makes the cutting of attributes cleaner and doesn't grow the public API. (for instance, there could be other solutions)

The get_attributes util you wrote could be put to that use as well

[v5] 🚨Refactor subprocessors handling in processors #41633 dummy

…rom-processors

…m/yonigozlan/transformers into remove-attributes-from-processors

github-actions · 2025-10-22T19:33:39Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: align, altclip, aria, auto, aya_vision, bark, blip, blip_2, bridgetower, bros, chameleon, chinese_clip, clap, clip, clipseg, clvp

yonigozlan added 4 commits October 15, 2025 15:47

remove attributes and add all missing sub processors to their auto cl…

f48a47b

…asses

remove all mentions of .attributes

d5d5c58

cleanup

dd505b5

fix processor tests

6a1448f

vasqu reviewed Oct 15, 2025

View reviewed changes

yonigozlan requested review from ArthurZucker, Cyrilvallez, molbap and zucchini-nlp October 15, 2025 18:03

fix modular

a292900

zucchini-nlp reviewed Oct 16, 2025

View reviewed changes

remove last attributes

63a255d

ArthurZucker removed their request for review October 16, 2025 14:10

yonigozlan added 3 commits October 16, 2025 16:02

fixup

ef73759

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

b5e8b2e

…rom-processors

fixes after merge

f14ff3c

fix wrong tokenizer in auto florence2

0306430

yonigozlan added 5 commits October 17, 2025 09:26

fix missing audio_processor + nits

01cb815

Override __init__ in NewProcessor and change hf-internal-testing-repo…

49ec906

… (temporarily)

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

7dd5682

…rom-processors

fix auto tokenizer test

946cc5c

add init to markup_lm

b0cb3e0

yonigozlan changed the title ~~[v5] Refactor subprocessors handling in processors~~ [v5] 🚨Refactor subprocessors handling in processors Oct 17, 2025

yonigozlan added 2 commits October 17, 2025 10:36

update CustomProcessor in custom_processing

3b9e846

remove print

53de7a4

Merge branch 'main' into remove-attributes-from-processors

93d2c4d

ydshieh added a commit that referenced this pull request Oct 20, 2025

trigger for

ea9875e

[v5] 🚨Refactor subprocessors handling in processors #41633 dummy

molbap reviewed Oct 20, 2025

View reviewed changes

ydshieh added a commit that referenced this pull request Oct 21, 2025

trigger for

f516acc

[v5] 🚨Refactor subprocessors handling in processors #41633 dummy

yonigozlan mentioned this pull request Oct 21, 2025

Simplify and standardize processor tests #41773

Open

yonigozlan added 6 commits October 22, 2025 17:42

Merge remote-tracking branch 'upstream/main' into remove-attributes-f…

feeec28

…rom-processors

nit

4a6b080

Merge branch 'remove-attributes-from-processors' of https://github.co…

02402a0

…m/yonigozlan/transformers into remove-attributes-from-processors

fix test modeling owlv2

757e1f1

fix test_processing_layoutxlm

bf763b2

Fix owlv2, wav2vec2, markuplm, voxtral issues

0799a0a

	class AutoModelForAudioTokenization(_BaseAutoModelClass):
	_model_mapping = MODEL_FOR_AUDIO_TOKENIZATION_MAPPING

Uh oh!

[v5] 🚨Refactor subprocessors handling in processors #41633

Are you sure you want to change the base?

[v5] 🚨Refactor subprocessors handling in processors #41633

Uh oh!

Conversation

yonigozlan commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ydshieh commented Oct 16, 2025

Uh oh!

yonigozlan commented Oct 16, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 16, 2025

Uh oh!

zucchini-nlp commented Oct 17, 2025

Uh oh!

ydshieh commented Oct 17, 2025

Uh oh!

yonigozlan commented Oct 17, 2025

Uh oh!

ydshieh commented Oct 17, 2025

Uh oh!

yonigozlan commented Oct 17, 2025

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yonigozlan commented Oct 15, 2025 •

edited

Loading