Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

do_lower_case not saved/loaded correctly for Tokenizers #8001

Closed
tholor opened this issue Oct 23, 2020 · 2 comments · Fixed by #8006
Closed

do_lower_case not saved/loaded correctly for Tokenizers #8001

tholor opened this issue Oct 23, 2020 · 2 comments · Fixed by #8006

Comments

@tholor
Copy link
Contributor

tholor commented Oct 23, 2020

Environment info

  • transformers version: 3.4.0
  • Platform: Linux-5.4.0-52-generic-x86_64-with-debian-buster-sid
  • Python version: 3.7.6
  • PyTorch version (GPU?): 1.5.1 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help

@mfuntowicz

Information

The do_lower_case property of BertTokenizer is not correctly restored after saving / loading.

To reproduce

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
print(tokenizer.do_lower_case)

tokenizer.save_pretrained("debug_tokenizer")
tokenizer_loaded = BertTokenizer.from_pretrained("debug_tokenizer")
print(tokenizer_loaded.do_lower_case)

returns

False
True

Expected behavior

Same object attributes after saving / loading

@thomwolf
Copy link
Member

Oh! I'll take a look, thanks for the report @tholor

thomwolf added a commit that referenced this issue Oct 23, 2020
thomwolf added a commit that referenced this issue Oct 26, 2020
…8006)

* fixing #8001

* make T5 tokenizer serialization more robust - style
@tholor
Copy link
Contributor Author

tholor commented Oct 26, 2020

Thanks for the fast fix @thomwolf ! Very much appreciated!

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this issue Nov 15, 2020
…ialization (huggingface#8006)

* fixing huggingface#8001

* make T5 tokenizer serialization more robust - style
fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this issue Nov 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants