-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saved slow tokenizers cannot be loaded in AutoTokenizer
after environment change
#15283
Comments
Intriguing behavior indeed! 🕵️♂️ By digging a little, I personally think that the problem comes from the fact that the slow version of the tokenizer can save a transformers/src/transformers/tokenization_utils_base.py Lines 2039 to 2043 in 4df6950
If we look closer, the transformers/src/transformers/tokenization_utils_base.py Lines 1672 to 1697 in 4df6950
So, this leads to another question: why does a slow tokenizer need to know about a So I would propose to remove the retrieval of this file when the calling class is a slow version: I have started to work on this change in the PR #15319 [which needs another PR #15328 to be merged to be functional]. This fix will avoid creating configuration files with |
I think you're correct, and this should definitely address the problem pointed out above. Thank you, @SaulLu! |
I'm closing this issue as this should be fixed by #15319 🙂 |
+1 |
Environment info
transformers
version: 4.16.0.dev0Who can help
@SaulLu, @LysandreJik
Information
After saving a slow tokenizer locally, this tokenizer cannot be used with
AutoTokenizer
after changing environments. The reason is that the tokenizer saves a link to a local file in itstokenizer_file
attribute of theinit_kwargs
, which then gets saved in thetokenizer_config.json
.The
AutoTokenizer
inspects that field in order to load the file, but if the environment has changed (for example, the tokenizer pushed to the hub and re-used on a different computer), then it is unable to do so and crashes.To reproduce
The
tokenizer_config.json
looks like this (seetokenizer_file
):If I update this value to something different to simulate a path saved on a different machine, I end up with the following:
An example of this happening in production is available here, with the TrOCR model (@NielsRogge):
I get a permission denied error because the
tokenizer_config.json
points to the following:And here's how you can recreate the same issue:
The text was updated successfully, but these errors were encountered: