-
Notifications
You must be signed in to change notification settings - Fork 27.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[trainer] allow processor instead of tokenizer #30864
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -69,6 +69,7 @@ | |||||||||||||||||||
MODEL_MAPPING_NAMES, | ||||||||||||||||||||
) | ||||||||||||||||||||
from .optimization import Adafactor, get_scheduler | ||||||||||||||||||||
from .processing_utils import ProcessorMixin | ||||||||||||||||||||
from .pytorch_utils import ( | ||||||||||||||||||||
ALL_LAYERNORM_LAYERS, | ||||||||||||||||||||
is_torch_greater_or_equal_than_1_13, | ||||||||||||||||||||
|
@@ -318,6 +319,10 @@ class Trainer: | |||||||||||||||||||
The tokenizer used to preprocess the data. If provided, will be used to automatically pad the inputs to the | ||||||||||||||||||||
maximum length when batching inputs, and it will be saved along the model to make it easier to rerun an | ||||||||||||||||||||
interrupted training or reuse the fine-tuned model. | ||||||||||||||||||||
processor ([`ProcessorMixin`], *optional*): | ||||||||||||||||||||
The processor used to pre- and post-process the data for multimodal models. If provided, will be used to | ||||||||||||||||||||
automatically pad the inputs to the maximum length when batching inputs, and it will be saved along the | ||||||||||||||||||||
model to make it easier to rerun an interrupted training or reuse the fine-tuned model. | ||||||||||||||||||||
model_init (`Callable[[], PreTrainedModel]`, *optional*): | ||||||||||||||||||||
A function that instantiates the model to be used. If provided, each call to [`~Trainer.train`] will start | ||||||||||||||||||||
from a new instance of the model as given by this function. | ||||||||||||||||||||
|
@@ -375,6 +380,7 @@ def __init__( | |||||||||||||||||||
train_dataset: Optional[Union[Dataset, IterableDataset, "datasets.Dataset"]] = None, | ||||||||||||||||||||
eval_dataset: Optional[Union[Dataset, Dict[str, Dataset], "datasets.Dataset"]] = None, | ||||||||||||||||||||
tokenizer: Optional[PreTrainedTokenizerBase] = None, | ||||||||||||||||||||
processor: Optional[ProcessorMixin] = None, | ||||||||||||||||||||
model_init: Optional[Callable[[], PreTrainedModel]] = None, | ||||||||||||||||||||
compute_metrics: Optional[Callable[[EvalPrediction], Dict]] = None, | ||||||||||||||||||||
callbacks: Optional[List[TrainerCallback]] = None, | ||||||||||||||||||||
|
@@ -510,6 +516,10 @@ def __init__( | |||||||||||||||||||
): | ||||||||||||||||||||
self.place_model_on_device = False | ||||||||||||||||||||
|
||||||||||||||||||||
self.tokenizer = processor if processor is not None else tokenizer | ||||||||||||||||||||
if processor is not None and hasattr(processor, "feature_extractor"): | ||||||||||||||||||||
tokenizer = processor.feature_extractor | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should add a check here to ensure if the user has passed in both There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IMO it's ok to let the user pass both and have the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However we never save their original tokenizer. This can lead to confusion down the road because their tokenizer is essentially never used. I'd rather guard this explicitly.; There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I disagree here, it makes the behaviour ambiguous. In effect, this PR means we're deprecating the use of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is super audio specific and can create surprising behaviour. If I passed in a processor=processor, I would expect the processor to be used, not the feature extractor. Instead, if the previous scripts e.g. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that all this is doing is setting the padding method in the default data collator: transformers/src/transformers/trainer.py Lines 513 to 517 in 15c74a2
There's no This is why we look for the corresponding attributes in the processor: transformers/src/transformers/trainer.py Lines 525 to 528 in f8dc983
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It doesn't just define padding behaviour. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note here that we're setting: tokenizer = processor.feature_extractor Not: self.tokenizer = processor.feature_extractor => the feature extractor is strictly used for padding purposes in the data collator, and not set to an attribute of the trainer In fact, since we set: self.tokenizer = processor the processor is correctly the attribute which is both saved and pushed to the hub. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree though that this behaviour is somewhat "silent' to the user and can be improved on (will iterate on this once we have a design established) |
||||||||||||||||||||
|
||||||||||||||||||||
default_collator = ( | ||||||||||||||||||||
DataCollatorWithPadding(tokenizer) | ||||||||||||||||||||
if tokenizer is not None and isinstance(tokenizer, (PreTrainedTokenizerBase, SequenceFeatureExtractor)) | ||||||||||||||||||||
|
@@ -518,7 +528,6 @@ def __init__( | |||||||||||||||||||
self.data_collator = data_collator if data_collator is not None else default_collator | ||||||||||||||||||||
self.train_dataset = train_dataset | ||||||||||||||||||||
self.eval_dataset = eval_dataset | ||||||||||||||||||||
self.tokenizer = tokenizer | ||||||||||||||||||||
|
||||||||||||||||||||
# Bnb Quantized models doesn't support `.to` operation. | ||||||||||||||||||||
if ( | ||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of setting
self.tokenizer
to processor, we should:self.tokenizer
toself.processor
in the trainer. The removes ambiguity for anyone reading the codeprocessor
tokenizer
which returnsself.processor
alongside a warning saying self.tokenizer is deprecatedThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that
tokenizer
is not deprecated. If we're fine-tuning LLM's there's no notion of aprocessor
, only atokenizer
. Theprocessor
is only relevant when we're training a multimodal model, such as an ASR model.This is why we maintain the
tokenizer
attribute in the Trainer. What I propose we do is have two attributes:self.tokenizer
-> used for LLMs where there is only atokenizer
. Will beNone
for multimodal modelsself.processor
-> used for multimodal models where there is aprocessor
. Will beNone
for LLMsThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would much rather have
self.processor
:) Or even be more clear:self.multimodal_processor