Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Not entirely sure about this change as there is a trade-off API complexity/ease of use.
This PR adds
tokenizer
as an optional argument toTrainer
(if this is approved, will do the same forTFTrainer
, I have a few recent changes to port there but was mainly waiting for @jplu to be back from vacation to make the two APIs on par).The benefit is that:
data_collator
that will automatically pad examples if the tokenizer is provided, so the user doesn't have to learn about data_collators for simple examples.Trainer
for the intermediary checkpoints, so it a checkpoint folder can be used directly with our scripts when resuming an interrupted training.As for the bad part, it's just that it adds a new argument to
Trainer
.