-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer] allow processor instead of tokenizer #30864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this! I think it makes sense, but let's guard some user behavior 🤗
src/transformers/trainer.py
Outdated
self.tokenizer = processor if processor is not None else tokenizer | ||
if processor is not None and hasattr(processor, "feature_extractor"): | ||
tokenizer = processor.feature_extractor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a check here to ensure if the user has passed in both tokenizer
and processor
, to raise an error about you must only pass in one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's ok to let the user pass both and have the processor
take precedence (there's no harm in this for the user)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However we never save their original tokenizer. This can lead to confusion down the road because their tokenizer is essentially never used. I'd rather guard this explicitly.;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO it's ok to let the user pass both and have the processor take precedence (there's no harm in this for the user)
I disagree here, it makes the behaviour ambiguous. In effect, this PR means we're deprecating the use of the tokenizer
argument, so we should make it clear which argument is preferred and push the user towards that. Or at least throw a warning
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this!
At the moment, this solution is a bit too audio specific and introduces silent behaviour. Instead, we should explicitly push users to use processor
and processor
should accept all of our processing classes: processors, image processors, feature extractors, tokenizers
src/transformers/trainer.py
Outdated
if processor is not None and hasattr(processor, "feature_extractor"): | ||
tokenizer = processor.feature_extractor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is super audio specific and can create surprising behaviour. If I passed in a processor=processor, I would expect the processor to be used, not the feature extractor. Instead, if the previous scripts e.g. examples/pytorch/speech-recognition/run_speech_recognition_seq2seq.py
want just the feature extractor to be passed in, then that should be specified when calling the trainer i.e. processor=feature_extractor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that all this is doing is setting the padding method in the default data collator:
transformers/src/transformers/trainer.py
Lines 513 to 517 in 15c74a2
default_collator = ( | |
DataCollatorWithPadding(tokenizer) | |
if tokenizer is not None and isinstance(tokenizer, (PreTrainedTokenizerBase, SequenceFeatureExtractor)) | |
else default_data_collator | |
) |
There's no pad
method defined for processors, so the processor cannot be used here. Only sequence feature extractors and tokenizers have a pad
method defined, so they are the only two viable options.
This is why we look for the corresponding attributes in the processor:
transformers/src/transformers/trainer.py
Lines 525 to 528 in f8dc983
if hasattr(processor, "feature_extractor"): | |
tokenizer = processor.feature_extractor | |
elif hasattr(processor, "tokenizer"): | |
tokenizer = processor.tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't just define padding behaviour. If tokenizer
is set, then this object is also uploaded on push_to_hub
calls. If we do tokenizer = processor.feature_extractor
then a user specifies a processor but only the feature extractor is uploaded
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note here that we're setting:
tokenizer = processor.feature_extractor
Not:
self.tokenizer = processor.feature_extractor
=> the feature extractor is strictly used for padding purposes in the data collator, and not set to an attribute of the trainer
In fact, since we set:
self.tokenizer = processor
the processor is correctly the attribute which is both saved and pushed to the hub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree though that this behaviour is somewhat "silent' to the user and can be improved on (will iterate on this once we have a design established)
src/transformers/trainer.py
Outdated
@@ -510,6 +516,10 @@ def __init__( | |||
): | |||
self.place_model_on_device = False | |||
|
|||
self.tokenizer = processor if processor is not None else tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of setting self.tokenizer
to processor, we should:
- update all the references of
self.tokenizer
toself.processor
in the trainer. The removes ambiguity for anyone reading the code - If tokenizer is passed in as an argument, raise a warning saying it's deprecated in favour of
processor
- Add a property
tokenizer
which returnsself.processor
alongside a warning saying self.tokenizer is deprecated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that tokenizer
is not deprecated. If we're fine-tuning LLM's there's no notion of a processor
, only a tokenizer
. The processor
is only relevant when we're training a multimodal model, such as an ASR model.
This is why we maintain the tokenizer
attribute in the Trainer. What I propose we do is have two attributes:
self.tokenizer
-> used for LLMs where there is only atokenizer
. Will beNone
for multimodal modelsself.processor
-> used for multimodal models where there is aprocessor
. Will beNone
for LLMs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather have self.processor
:) Or even be more clear: self.multimodal_processor
Thanks for the review @amyeroberts! Note that we are not replacing Likewise, we only extract |
@sanchit-gandhi I realise there might be some context missing. There was already a PR which was added to enable this for image_processors #29896 - which was ultimately undone in #30129. The ultimate realisation was that adding an argument like As I mentioned above, this doesn't just affect data collation, but also the objects uploaded to the hub on There's an ongoing pull request here #30102, which @NielsRogge has been working on. I agree with his suggestion of a |
I'd missed that PR - thanks for the context @amyeroberts, that's super helpful! Replied directly on the PR: #30102 (comment) |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
What does this PR do?
Fixes #23222 by allowing the user to pass the argument
processor
to the Trainer and Seq2SeqTrainer (instead oftokenizer
).This is much more intuitive when training multimodal models, as before we were encouraging users to pass:
Which is super confusing. After this PR, users can do:
Which is much more sensible.
The Trainer does three things with the
tokenizer
:transformers/src/transformers/trainer.py
Lines 513 to 517 in f4014e7
transformers/src/transformers/trainer.py
Line 853 in f4014e7
transformers/src/transformers/trainer.py
Lines 3398 to 3399 in f4014e7
We can do all of these things directly with the processor as well. Therefore, all we have to do is set the following in the init method of the Trainer: