Add tokenizer to Trainer #6689

sgugger · 2020-08-24T14:48:40Z

Not entirely sure about this change as there is a trade-off API complexity/ease of use.

This PR adds tokenizer as an optional argument to Trainer (if this is approved, will do the same for TFTrainer, I have a few recent changes to port there but was mainly waiting for @jplu to be back from vacation to make the two APIs on par).

The benefit is that:

we can have a smart default data_collator that will automatically pad examples if the tokenizer is provided, so the user doesn't have to learn about data_collators for simple examples.
we can save the tokenizer along the model directly inside Trainer for the intermediary checkpoints, so it a checkpoint folder can be used directly with our scripts when resuming an interrupted training.

As for the bad part, it's just that it adds a new argument to Trainer.

codecov · 2020-08-24T14:52:24Z

Codecov Report

Merging #6689 into master will decrease coverage by 1.73%.
The diff coverage is 55.55%.

@@            Coverage Diff             @@
##           master    #6689      +/-   ##
==========================================
- Coverage   78.98%   77.24%   -1.74%     
==========================================
  Files         156      156              
  Lines       28398    28405       +7     
==========================================
- Hits        22429    21941     -488     
- Misses       5969     6464     +495

Impacted Files	Coverage Δ
src/transformers/trainer.py	`53.66% <55.55%> (-0.13%)`	⬇️
src/transformers/modeling_tf_albert.py	`21.47% <0.00%> (-69.44%)`	⬇️
src/transformers/tokenization_xlm.py	`16.26% <0.00%> (-66.67%)`	⬇️
src/transformers/pipelines.py	`25.63% <0.00%> (-54.32%)`	⬇️
src/transformers/optimization.py	`58.88% <0.00%> (-36.67%)`	⬇️
src/transformers/modeling_tf_gpt2.py	`65.68% <0.00%> (-29.33%)`	⬇️
src/transformers/optimization_tf.py	`33.33% <0.00%> (-24.33%)`	⬇️
src/transformers/modeling_tf_auto.py	`48.79% <0.00%> (-18.08%)`	⬇️
src/transformers/data/processors/squad.py	`13.76% <0.00%> (-14.38%)`	⬇️
src/transformers/modeling_auto.py	`64.36% <0.00%> (-14.37%)`	⬇️
... and 13 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update abc0202...54feec2. Read the comment docs.

LysandreJik

I like this a lot and feel like it should have been like this from the start! Since we consider models effectively as model-tokenizer pairs, I think it makes a lot of sense for the trainer to handle both.

Especially with regards to saving/reloading, which has always been an issue with users not understanding how to reload from checkpoints as the tokenizers were not saved in the same folder.

jplu · 2020-08-25T11:36:20Z

Nice! I like it. Ok for me to do the same on the TF one 👍

Add tokenizer to Trainer

efc8279

sgugger requested review from julien-c, thomwolf and LysandreJik August 24, 2020 14:48

LysandreJik approved these changes Aug 24, 2020

View reviewed changes

Merge branch 'master' into trainer_tokenizer

54feec2

sgugger merged commit 124c3d6 into master Aug 25, 2020

sgugger deleted the trainer_tokenizer branch August 25, 2020 11:47

Zigur pushed a commit to Zigur/transformers that referenced this pull request Oct 26, 2020

Add tokenizer to Trainer (huggingface#6689)

d983276

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizer to Trainer #6689

Add tokenizer to Trainer #6689

sgugger commented Aug 24, 2020

codecov bot commented Aug 24, 2020 •

edited

Loading

LysandreJik left a comment

jplu commented Aug 25, 2020

Add tokenizer to Trainer #6689

Add tokenizer to Trainer #6689

Conversation

sgugger commented Aug 24, 2020

codecov bot commented Aug 24, 2020 • edited Loading

Codecov Report

LysandreJik left a comment

Choose a reason for hiding this comment

jplu commented Aug 25, 2020

codecov bot commented Aug 24, 2020 •

edited

Loading