-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tokenizers caching #502
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! This is really important for us!
src/nlp/utils/py_utils.py
Outdated
if _transformers_available: | ||
import transformers as tr | ||
|
||
if isinstance(obj, (tr.CTRLTokenizer, tr.GPT2Tokenizer, tr.OpenAIGPTTokenizer, tr.XLMTokenizer)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This list might be a bit cumbersome to maintain in the future.
Should we just do a check that the cache
attribute exists and is a dict maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it so that it checks if cache
exists and is a dict
co_filename = "" if obj.co_filename.startswith("<") else obj.co_filename | ||
co_firstlineno = 1 if obj.co_filename.startswith("<") else obj.co_firstlineno | ||
co_filename = "" if obj.co_filename.startswith("<") or obj.co_name == "<lambda>" else obj.co_filename | ||
co_firstlineno = 1 if obj.co_filename.startswith("<") or obj.co_name == "<lambda>" else obj.co_firstlineno |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to merge for me when you are happy with it :-)
I've found some cases where the caching didn't work properly for tokenizers:
unique_no_split_tokens
's attribute is not the same across sessions (after loading a tokenizer) then the caching could be inconsistentTo fix that, this is what I did:
save_regex
function for pickle that makes regex dumps deterministicunique_no_split_tokens
deterministic in Sort unique_no_split_tokens to make it deterministic transformers#6461I also added tests to make sure that tokenizers hashing works as expected.
In the future we should find a way to test if hashing also works across session (maybe using two CI jobs ? or by hardcoding a tokenizer's hash ?)