[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006

thomwolf · 2020-10-23T15:49:26Z

What does this PR do?

Now the tokenizers classes have to send all the keyword arguments of the __init__ up to the base class of the tokenizer (by super().__init__) were they are stored in init_kwargs for serialized saving/reloading with save_pretrained/from_pretrained.

Adding a test on tokenizers serialization that all the keyword arguments of the __init__ are found in the saved init_kwargs to avoid forgetting to send some arguments up in future (and current) tokenizers.

Make T5 tokenizer serialization more robust.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to the it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

thomwolf · 2020-10-23T22:16:09Z

src/transformers/tokenization_deberta.py

    special_tokens (:obj:`list`, optional):
      List of special tokens to be added to the end of the vocabulary.


  """

-    def __init__(self, vocab_file=None, do_lower_case=True, special_tokens=None):


Was not used in the class so I think it's better to remove it from the init args.

sgugger

Looks very clean!

…ialization (huggingface#8006) * fixing huggingface#8001 * make T5 tokenizer serialization more robust - style

…zers serialization (huggingface#8006)" This reverts commit 9f79b9a.

thomwolf added 2 commits October 23, 2020 17:47

fixing #8001

870d047

make T5 tokenizer serialization more robust - style

a795350

thomwolf marked this pull request as ready for review October 23, 2020 22:12

thomwolf changed the title ~~[WIP|tokenizers] Fixing #8001 - Adding tests on tokenizers serialization~~ [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization Oct 23, 2020

thomwolf requested review from sgugger, LysandreJik and patrickvonplaten October 23, 2020 22:15

thomwolf commented Oct 23, 2020

View reviewed changes

sgugger approved these changes Oct 23, 2020

View reviewed changes

thomwolf merged commit 79eb391 into master Oct 26, 2020

thomwolf deleted the fix-do-lower-case branch October 26, 2020 09:27

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "[tokenizers] Fixing huggingface#8001 - Adding tests on tokeni…

6abafa2

…zers serialization (huggingface#8006)" This reverts commit 9f79b9a.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006

thomwolf commented Oct 23, 2020 •

edited

Loading

thomwolf Oct 23, 2020

sgugger left a comment

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006

Conversation

thomwolf commented Oct 23, 2020 • edited Loading

What does this PR do?

Before submitting

Who can review?

thomwolf Oct 23, 2020

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

thomwolf commented Oct 23, 2020 •

edited

Loading