Saved slow tokenizers cannot be loaded in `AutoTokenizer` after environment change #15283

LysandreJik · 2022-01-21T18:58:43Z

Environment info

transformers version: 4.16.0.dev0
Platform: Linux-5.16.1-arch1-1-x86_64-with-arch
Python version: 3.6.15
PyTorch version (GPU?): not installed (NA)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): 0.3.5 (cpu)
Jax version: 0.2.17
JaxLib version: 0.1.69

Who can help

Information

After saving a slow tokenizer locally, this tokenizer cannot be used with AutoTokenizer after changing environments. The reason is that the tokenizer saves a link to a local file in its tokenizer_file attribute of the init_kwargs, which then gets saved in the tokenizer_config.json.

The AutoTokenizer inspects that field in order to load the file, but if the environment has changed (for example, the tokenizer pushed to the hub and re-used on a different computer), then it is unable to do so and crashes.

To reproduce

In [2]: from transformers import AutoTokenizer, BertTokenizer
In [3]: tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
In [4]: tokenizer.save_pretrained("local_folder")
Out[4]: 
('local_folder/tokenizer_config.json',
 'local_folder/special_tokens_map.json',
 'local_folder/vocab.txt',
 'local_folder/added_tokens.json')

The tokenizer_config.json looks like this (see tokenizer_file):

{"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "tokenizer_file": "/home/user/.cache/huggingface/transformers/226a307193a9f4344264cdc76a12988448a25345ba172f2c7421f3b6810fddad.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6", "name_or_path": "bert-base-cased", "tokenizer_class": "BertTokenizer"}

If I update this value to something different to simulate a path saved on a different machine, I end up with the following:

In [5]: AutoTokenizer.from_pretrained("local_folder")
Traceback (most recent call last):
  File "/home/lysandre/transformers/.env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-12add8db4ef2>", line 1, in <module>
    AutoTokenizer.from_pretrained("local_folder")
  File "/home/lysandre/transformers/src/transformers/models/auto/tokenization_auto.py", line 545, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1877, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lysandre/transformers/src/transformers/models/bert/tokenization_bert_fast.py", line 188, in __init__
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_fast.py", line 108, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

An example of this happening in production is available here, with the TrOCR model (@NielsRogge):

In [2]: from transformers import TrOCRProcessor
In [3]: processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten", revision="554a6621f60cba4f756f4bed2caaa7e6e5b0a2e3")
Downloading: 100%|██████████| 4.03k/4.03k [00:00<00:00, 5.35MB/s]
Downloading: 100%|██████████| 228/228 [00:00<00:00, 184kB/s]
Downloading: 100%|██████████| 1.28k/1.28k [00:00<00:00, 1.04MB/s]
Downloading: 100%|██████████| 878k/878k [00:00<00:00, 6.99MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 6.06MB/s]
Downloading: 100%|██████████| 772/772 [00:00<00:00, 1.02MB/s]
Traceback (most recent call last):
  File "/home/lysandre/transformers/.env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-2e8a6ceb8f5c>", line 1, in <module>
    processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten", revision="554a6621f60cba4f756f4bed2caaa7e6e5b0a2e3")
  File "/home/lysandre/transformers/src/transformers/models/trocr/processing_trocr.py", line 110, in from_pretrained
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/lysandre/transformers/src/transformers/models/auto/tokenization_auto.py", line 545, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1877, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lysandre/transformers/src/transformers/models/roberta/tokenization_roberta_fast.py", line 184, in __init__
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/models/gpt2/tokenization_gpt2_fast.py", line 146, in __init__
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_fast.py", line 108, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: Permission denied (os error 13)

I get a permission denied error because the tokenizer_config.json points to the following:

"tokenizer_file": "/root/.cache/huggingface/transformers/e16a2590deb9e6d73711d6e05bf27d832fa8c1162d807222e043ca650a556964.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730"

And here's how you can recreate the same issue:

feature_extractor = ViTFeatureExtractor(size=encoder_config.image_size)
tokenizer = RobertaTokenizer.from_pretrained("roberta-large")
processor = TrOCRProcessor(feature_extractor, tokenizer)

processor.save_pretrained("local_folder")

The text was updated successfully, but these errors were encountered:

SaulLu · 2022-01-26T23:05:07Z

Intriguing behavior indeed! 🕵️‍♂️

By digging a little, I personally think that the problem comes from the fact that the slow version of the tokenizer can save a "tokenizer_file" key in the tokenizer_config.json file. Indeed, if the tokenizer is of a fast type then in this case, the key cannot be saved in the tokenizer_config.json. This is due to the fact that "tokenizer_file" is part of the dictionary attribute vocab_files_names of a XxxTokenizerfast instance but is not part of the attribute vocab_files_names of a XxxTokenizer instance.

transformers/src/transformers/tokenization_utils_base.py

Lines 2039 to 2043 in 4df6950

    
           tokenizer_config = copy.deepcopy(self.init_kwargs) 
        
           if len(self.init_inputs) > 0: 
        
               tokenizer_config["init_inputs"] = copy.deepcopy(self.init_inputs) 
        
           for file_id in self.vocab_files_names.keys(): 
        
               tokenizer_config.pop(file_id, None)

If we look closer, the tokenizer_file is added to the init_kwargs attribute at this point in the code:

transformers/src/transformers/tokenization_utils_base.py

Lines 1672 to 1697 in 4df6950

    
           additional_files_names = { 
        
               "added_tokens_file": ADDED_TOKENS_FILE, 
        
               "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE, 
        
               "tokenizer_config_file": TOKENIZER_CONFIG_FILE, 
        
               "tokenizer_file": fast_tokenizer_file, 
        
           } 
        
           # Look for the tokenizer files 
        
           for file_id, file_name in {**cls.vocab_files_names, **additional_files_names}.items(): 
        
               if os.path.isdir(pretrained_model_name_or_path): 
        
                   if subfolder is not None: 
        
                       full_file_name = os.path.join(pretrained_model_name_or_path, subfolder, file_name) 
        
                   else: 
        
                       full_file_name = os.path.join(pretrained_model_name_or_path, file_name) 
        
                   if not os.path.exists(full_file_name): 
        
                       logger.info(f"Didn't find file {full_file_name}. We won't load it.") 
        
                       full_file_name = None 
        
               else: 
        
                   full_file_name = hf_bucket_url( 
        
                       pretrained_model_name_or_path, 
        
                       filename=file_name, 
        
                       subfolder=subfolder, 
        
                       revision=revision, 
        
                       mirror=None, 
        
                   ) 
        
               vocab_files[file_id] = full_file_name

So, this leads to another question: why does a slow tokenizer need to know about a tokenizer_file? Personally, I think it's just a historical legacy from when tokenizers were not separated into slow and fast versions (see related PRs: #5056 and #7659) - but I could be wrong or I could also be missing the usefulness of the tokenizer_file for slow tokenizers. But after looking into it, I don't think knowing the location of the tokenizer_file file is useful for a slow version of the tokenizer.

So I would propose to remove the retrieval of this file when the calling class is a slow version: I have started to work on this change in the PR #15319 [which needs another PR #15328 to be merged to be functional].

This fix will avoid creating configuration files with tokenizer_file key the that would not be informative (and worse as you shown source of errors). What do you think? Does it address the problem you were pointing out?

LysandreJik · 2022-01-27T15:45:41Z

I think you're correct, and this should definitely address the problem pointed out above. Thank you, @SaulLu!

SaulLu · 2022-02-01T15:50:09Z

I'm closing this issue as this should be fixed by #15319 🙂

Mohammed20201991 · 2023-03-15T18:45:33Z

+1

SaulLu mentioned this issue Jan 24, 2022

fix the tokenizer_config.json file for the slow tokenizer when a fast version is available #15319

Merged

5 tasks

SaulLu closed this as completed Feb 1, 2022

debanjum mentioned this issue Aug 3, 2022

Fail to Load CLIP Model (CLIP-ViT-B-32) UKPLab/sentence-transformers#1659

Closed

NielsRogge mentioned this issue Dec 13, 2022

TrOCR base-harge-stage1 Processor issue #20751

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Saved slow tokenizers cannot be loaded in `AutoTokenizer` after environment change #15283

Saved slow tokenizers cannot be loaded in `AutoTokenizer` after environment change #15283

LysandreJik commented Jan 21, 2022

SaulLu commented Jan 26, 2022 •

edited

Loading

LysandreJik commented Jan 27, 2022

SaulLu commented Feb 1, 2022

Mohammed20201991 commented Mar 15, 2023

Saved slow tokenizers cannot be loaded in AutoTokenizer after environment change #15283

Saved slow tokenizers cannot be loaded in AutoTokenizer after environment change #15283

Comments

LysandreJik commented Jan 21, 2022

Environment info

Who can help

Information

To reproduce

SaulLu commented Jan 26, 2022 • edited Loading

LysandreJik commented Jan 27, 2022

SaulLu commented Feb 1, 2022

Mohammed20201991 commented Mar 15, 2023

Saved slow tokenizers cannot be loaded in `AutoTokenizer` after environment change #15283

Saved slow tokenizers cannot be loaded in `AutoTokenizer` after environment change #15283

SaulLu commented Jan 26, 2022 •

edited

Loading