Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokenizers caching #502

Merged
merged 7 commits into from
Aug 19, 2020
Merged

Fix tokenizers caching #502

merged 7 commits into from
Aug 19, 2020

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Aug 13, 2020

I've found some cases where the caching didn't work properly for tokenizers:

  1. if a tokenizer has a regex pattern, then the caching would be inconsistent across sessions
  2. if a tokenizer has a cache attribute that changes after some calls, the the caching would not work after cache updates
  3. if a tokenizer is used inside a function, the caching of this function would result in the same cache file for different tokenizers
  4. if unique_no_split_tokens's attribute is not the same across sessions (after loading a tokenizer) then the caching could be inconsistent

To fix that, this is what I did:

  1. register a specific save_regex function for pickle that makes regex dumps deterministic
  2. ignore cache attribute of some tokenizers before dumping
  3. enable recursive dump by default for all dumps
  4. make unique_no_split_tokens deterministic in Sort unique_no_split_tokens to make it deterministic transformers#6461

I also added tests to make sure that tokenizers hashing works as expected.
In the future we should find a way to test if hashing also works across session (maybe using two CI jobs ? or by hardcoding a tokenizer's hash ?)

@lhoestq
Copy link
Member Author

lhoestq commented Aug 13, 2020

This should fix #501 and also the issue you sent me on slack @sgugger .

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! This is really important for us!

if _transformers_available:
import transformers as tr

if isinstance(obj, (tr.CTRLTokenizer, tr.GPT2Tokenizer, tr.OpenAIGPTTokenizer, tr.XLMTokenizer)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list might be a bit cumbersome to maintain in the future.
Should we just do a check that the cache attribute exists and is a dict maybe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it so that it checks if cache exists and is a dict

co_filename = "" if obj.co_filename.startswith("<") else obj.co_filename
co_firstlineno = 1 if obj.co_filename.startswith("<") else obj.co_firstlineno
co_filename = "" if obj.co_filename.startswith("<") or obj.co_name == "<lambda>" else obj.co_filename
co_firstlineno = 1 if obj.co_filename.startswith("<") or obj.co_name == "<lambda>" else obj.co_firstlineno
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good!

Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to merge for me when you are happy with it :-)

@lhoestq lhoestq merged commit 0a45189 into master Aug 19, 2020
@lhoestq lhoestq deleted the fix-tokenizers-caching branch August 19, 2020 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants