Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tokenizers] Fixing #8001 - Adding tests on tokenizers serialization #8006

Merged
merged 2 commits into from
Oct 26, 2020

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented Oct 23, 2020

What does this PR do?

Fixes #8001

Now the tokenizers classes have to send all the keyword arguments of the __init__ up to the base class of the tokenizer (by super().__init__) were they are stored in init_kwargs for serialized saving/reloading with save_pretrained/from_pretrained.

Adding a test on tokenizers serialization that all the keyword arguments of the __init__ are found in the saved init_kwargs to avoid forgetting to send some arguments up in future (and current) tokenizers.

Make T5 tokenizer serialization more robust.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to the it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@thomwolf thomwolf marked this pull request as ready for review October 23, 2020 22:12
@thomwolf thomwolf changed the title [WIP|tokenizers] Fixing #8001 - Adding tests on tokenizers serialization [tokenizers] Fixing #8001 - Adding tests on tokenizers serialization Oct 23, 2020
special_tokens (:obj:`list`, optional):
List of special tokens to be added to the end of the vocabulary.


"""

def __init__(self, vocab_file=None, do_lower_case=True, special_tokens=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not used in the class so I think it's better to remove it from the init args.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very clean!

@thomwolf thomwolf merged commit 79eb391 into master Oct 26, 2020
@thomwolf thomwolf deleted the fix-do-lower-case branch October 26, 2020 09:27
fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
…ialization (huggingface#8006)

* fixing huggingface#8001

* make T5 tokenizer serialization more robust - style
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

do_lower_case not saved/loaded correctly for Tokenizers
2 participants