Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saved slow tokenizers cannot be loaded in AutoTokenizer after environment change #15283

Closed
LysandreJik opened this issue Jan 21, 2022 · 4 comments

Comments

@LysandreJik
Copy link
Member

Environment info

  • transformers version: 4.16.0.dev0
  • Platform: Linux-5.16.1-arch1-1-x86_64-with-arch
  • Python version: 3.6.15
  • PyTorch version (GPU?): not installed (NA)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): 0.3.5 (cpu)
  • Jax version: 0.2.17
  • JaxLib version: 0.1.69

Who can help

@SaulLu, @LysandreJik

Information

After saving a slow tokenizer locally, this tokenizer cannot be used with AutoTokenizer after changing environments. The reason is that the tokenizer saves a link to a local file in its tokenizer_file attribute of the init_kwargs, which then gets saved in the tokenizer_config.json.

The AutoTokenizer inspects that field in order to load the file, but if the environment has changed (for example, the tokenizer pushed to the hub and re-used on a different computer), then it is unable to do so and crashes.

To reproduce

In [2]: from transformers import AutoTokenizer, BertTokenizer
In [3]: tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
In [4]: tokenizer.save_pretrained("local_folder")
Out[4]: 
('local_folder/tokenizer_config.json',
 'local_folder/special_tokens_map.json',
 'local_folder/vocab.txt',
 'local_folder/added_tokens.json')

The tokenizer_config.json looks like this (see tokenizer_file):

{"do_lower_case": false, "do_basic_tokenize": true, "never_split": null, "unk_token": "[UNK]", "sep_token": "[SEP]", "pad_token": "[PAD]", "cls_token": "[CLS]", "mask_token": "[MASK]", "tokenize_chinese_chars": true, "strip_accents": null, "model_max_length": 512, "special_tokens_map_file": null, "tokenizer_file": "/home/user/.cache/huggingface/transformers/226a307193a9f4344264cdc76a12988448a25345ba172f2c7421f3b6810fddad.3dab63143af66769bbb35e3811f75f7e16b2320e12b7935e216bd6159ce6d9a6", "name_or_path": "bert-base-cased", "tokenizer_class": "BertTokenizer"}

If I update this value to something different to simulate a path saved on a different machine, I end up with the following:

In [5]: AutoTokenizer.from_pretrained("local_folder")
Traceback (most recent call last):
  File "/home/lysandre/transformers/.env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-12add8db4ef2>", line 1, in <module>
    AutoTokenizer.from_pretrained("local_folder")
  File "/home/lysandre/transformers/src/transformers/models/auto/tokenization_auto.py", line 545, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1877, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lysandre/transformers/src/transformers/models/bert/tokenization_bert_fast.py", line 188, in __init__
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_fast.py", line 108, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: No such file or directory (os error 2)

An example of this happening in production is available here, with the TrOCR model (@NielsRogge):

In [2]: from transformers import TrOCRProcessor
In [3]: processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten", revision="554a6621f60cba4f756f4bed2caaa7e6e5b0a2e3")
Downloading: 100%|██████████| 4.03k/4.03k [00:00<00:00, 5.35MB/s]
Downloading: 100%|██████████| 228/228 [00:00<00:00, 184kB/s]
Downloading: 100%|██████████| 1.28k/1.28k [00:00<00:00, 1.04MB/s]
Downloading: 100%|██████████| 878k/878k [00:00<00:00, 6.99MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 6.06MB/s]
Downloading: 100%|██████████| 772/772 [00:00<00:00, 1.02MB/s]
Traceback (most recent call last):
  File "/home/lysandre/transformers/.env/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 3343, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-3-2e8a6ceb8f5c>", line 1, in <module>
    processor = TrOCRProcessor.from_pretrained("microsoft/trocr-base-handwritten", revision="554a6621f60cba4f756f4bed2caaa7e6e5b0a2e3")
  File "/home/lysandre/transformers/src/transformers/models/trocr/processing_trocr.py", line 110, in from_pretrained
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "/home/lysandre/transformers/src/transformers/models/auto/tokenization_auto.py", line 545, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1749, in from_pretrained
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_base.py", line 1877, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/lysandre/transformers/src/transformers/models/roberta/tokenization_roberta_fast.py", line 184, in __init__
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/models/gpt2/tokenization_gpt2_fast.py", line 146, in __init__
    **kwargs,
  File "/home/lysandre/transformers/src/transformers/tokenization_utils_fast.py", line 108, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: Permission denied (os error 13)

I get a permission denied error because the tokenizer_config.json points to the following:

"tokenizer_file": "/root/.cache/huggingface/transformers/e16a2590deb9e6d73711d6e05bf27d832fa8c1162d807222e043ca650a556964.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730"

And here's how you can recreate the same issue:

feature_extractor = ViTFeatureExtractor(size=encoder_config.image_size)
tokenizer = RobertaTokenizer.from_pretrained("roberta-large")
processor = TrOCRProcessor(feature_extractor, tokenizer)

processor.save_pretrained("local_folder")
@SaulLu
Copy link
Contributor

SaulLu commented Jan 26, 2022

Intriguing behavior indeed! 🕵️‍♂️

By digging a little, I personally think that the problem comes from the fact that the slow version of the tokenizer can save a "tokenizer_file" key in the tokenizer_config.json file. Indeed, if the tokenizer is of a fast type then in this case, the key cannot be saved in the tokenizer_config.json. This is due to the fact that "tokenizer_file" is part of the dictionary attribute vocab_files_names of a XxxTokenizerfast instance but is not part of the attribute vocab_files_names of a XxxTokenizer instance.

tokenizer_config = copy.deepcopy(self.init_kwargs)
if len(self.init_inputs) > 0:
tokenizer_config["init_inputs"] = copy.deepcopy(self.init_inputs)
for file_id in self.vocab_files_names.keys():
tokenizer_config.pop(file_id, None)

If we look closer, the tokenizer_file is added to the init_kwargs attribute at this point in the code:

additional_files_names = {
"added_tokens_file": ADDED_TOKENS_FILE,
"special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,
"tokenizer_config_file": TOKENIZER_CONFIG_FILE,
"tokenizer_file": fast_tokenizer_file,
}
# Look for the tokenizer files
for file_id, file_name in {**cls.vocab_files_names, **additional_files_names}.items():
if os.path.isdir(pretrained_model_name_or_path):
if subfolder is not None:
full_file_name = os.path.join(pretrained_model_name_or_path, subfolder, file_name)
else:
full_file_name = os.path.join(pretrained_model_name_or_path, file_name)
if not os.path.exists(full_file_name):
logger.info(f"Didn't find file {full_file_name}. We won't load it.")
full_file_name = None
else:
full_file_name = hf_bucket_url(
pretrained_model_name_or_path,
filename=file_name,
subfolder=subfolder,
revision=revision,
mirror=None,
)
vocab_files[file_id] = full_file_name

So, this leads to another question: why does a slow tokenizer need to know about a tokenizer_file? Personally, I think it's just a historical legacy from when tokenizers were not separated into slow and fast versions (see related PRs: #5056 and #7659) - but I could be wrong or I could also be missing the usefulness of the tokenizer_file for slow tokenizers. But after looking into it, I don't think knowing the location of the tokenizer_file file is useful for a slow version of the tokenizer.

So I would propose to remove the retrieval of this file when the calling class is a slow version: I have started to work on this change in the PR #15319 [which needs another PR #15328 to be merged to be functional].

This fix will avoid creating configuration files with tokenizer_file key the that would not be informative (and worse as you shown source of errors). What do you think? Does it address the problem you were pointing out?

@LysandreJik
Copy link
Member Author

I think you're correct, and this should definitely address the problem pointed out above. Thank you, @SaulLu!

@SaulLu
Copy link
Contributor

SaulLu commented Feb 1, 2022

I'm closing this issue as this should be fixed by #15319 🙂

@Mohammed20201991
Copy link

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants