Fix tokenizers caching #502

lhoestq · 2020-08-13T15:53:37Z

I've found some cases where the caching didn't work properly for tokenizers:

if a tokenizer has a regex pattern, then the caching would be inconsistent across sessions
if a tokenizer has a cache attribute that changes after some calls, the the caching would not work after cache updates
if a tokenizer is used inside a function, the caching of this function would result in the same cache file for different tokenizers
if unique_no_split_tokens's attribute is not the same across sessions (after loading a tokenizer) then the caching could be inconsistent

To fix that, this is what I did:

register a specific save_regex function for pickle that makes regex dumps deterministic
ignore cache attribute of some tokenizers before dumping
enable recursive dump by default for all dumps
make unique_no_split_tokens deterministic in Sort unique_no_split_tokens to make it deterministic transformers#6461

I also added tests to make sure that tokenizers hashing works as expected.
In the future we should find a way to test if hashing also works across session (maybe using two CI jobs ? or by hardcoding a tokenizer's hash ?)

lhoestq · 2020-08-13T16:00:49Z

This should fix #501 and also the issue you sent me on slack @sgugger .

thomwolf

Very nice! This is really important for us!

thomwolf · 2020-08-17T10:28:56Z

src/nlp/utils/py_utils.py

+    if _transformers_available:
+        import transformers as tr
+
+        if isinstance(obj, (tr.CTRLTokenizer, tr.GPT2Tokenizer, tr.OpenAIGPTTokenizer, tr.XLMTokenizer)):


This list might be a bit cumbersome to maintain in the future.
Should we just do a check that the cache attribute exists and is a dict maybe?

I changed it so that it checks if cache exists and is a dict

thomwolf · 2020-08-17T10:29:18Z

src/nlp/utils/py_utils.py

-    co_filename = "" if obj.co_filename.startswith("<") else obj.co_filename
-    co_firstlineno = 1 if obj.co_filename.startswith("<") else obj.co_firstlineno
+    co_filename = "" if obj.co_filename.startswith("<") or obj.co_name == "<lambda>" else obj.co_filename
+    co_firstlineno = 1 if obj.co_filename.startswith("<") or obj.co_name == "<lambda>" else obj.co_firstlineno


…sses with cache

thomwolf

Good to merge for me when you are happy with it :-)

lhoestq added 5 commits August 13, 2020 17:34

fix tokenizers caching

e4a8425

register regex saver only if regex is available

a11a1a5

style

99e9d24

add _transformers_available in utils

a4513d7

quality

899dad8

thomwolf approved these changes Aug 17, 2020

View reviewed changes

lhoestq added 2 commits August 19, 2020 12:15

add etst for methods

71187cc

test for tokenizers' cache instead of having a list of tokenisers cla…

a7bed86

…sses with cache

thomwolf approved these changes Aug 19, 2020

View reviewed changes

lhoestq merged commit 0a45189 into master Aug 19, 2020

lhoestq deleted the fix-tokenizers-caching branch August 19, 2020 13:37

This was referenced Jul 7, 2021

TypeError: cannot pickle '_LazyModule' object huggingface/transformers#12549

Closed

Remove import of transformers #2602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tokenizers caching #502

Fix tokenizers caching #502

lhoestq commented Aug 13, 2020 •

edited

Loading

lhoestq commented Aug 13, 2020

thomwolf left a comment

thomwolf Aug 17, 2020

lhoestq Aug 19, 2020

thomwolf Aug 17, 2020

thomwolf left a comment

Fix tokenizers caching #502

Fix tokenizers caching #502

Conversation

lhoestq commented Aug 13, 2020 • edited Loading

lhoestq commented Aug 13, 2020

thomwolf left a comment

Choose a reason for hiding this comment

thomwolf Aug 17, 2020

Choose a reason for hiding this comment

lhoestq Aug 19, 2020

Choose a reason for hiding this comment

thomwolf Aug 17, 2020

Choose a reason for hiding this comment

thomwolf left a comment

Choose a reason for hiding this comment

lhoestq commented Aug 13, 2020 •

edited

Loading