Add tokenizers class mismatch detection between `cls` and checkpoint #12619

europeanplaice · 2021-07-09T21:35:25Z

What does this PR do?

Fixes #12416
This PR detects a mismatch between cls and a checkpoint a user intends to load.
However, It can't find a mismatch when a config doesn't contain the tokenizer's information.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

SaulLu

Thank you very much for working on the implementation of this check. 😄

I mainly left comments to 1) allow retrieving the tokenizer type from the config type and 2) extend the tests.

Feel free to let me know if I missed anything or if you want me to take over the extension of the tests!

SaulLu · 2021-07-13T10:22:41Z

src/transformers/tokenization_utils_base.py

+                with open(config_file, encoding="utf-8") as config_handle:
+                    config_dict = json.load(config_handle)
+                config_tokenizer_class = config_dict.get("tokenizer_class")
+


-------- EDIT:--------
Reading @sgugger 's answer, I also agree with him that we can simplify this part and use AutoConfig directly.

-------- Old comment:--------
The addition of the snippet below could therefore solve the limitation that you have shown in the test that you named test_limit_of_match_validation.

It would have to be checked by running all the tests, but I have the impression that by doing the imports at this level we don't have a circular import problem.

# If we have not yet found the original type of the tokenizer we are loading we see if we can infer it from the # type of the configuration file if config_dict is not None and config_tokenizer_class is None: from .models.auto.configuration_auto import CONFIG_MAPPING from .models.auto.tokenization_auto import TOKENIZER_MAPPING if "model_type" in config_dict: config_class = CONFIG_MAPPING[config_dict["model_type"]] else: # Fallback: use pattern matching on the string. for pattern, config_class_tmp in CONFIG_MAPPING.items(): if pattern in str(pretrained_model_name_or_path): config_class = config_class_tmp break if config_class in TOKENIZER_MAPPING.keys(): config_tokenizer_class, config_tokenizer_class_fast = TOKENIZER_MAPPING[config_class] if config_tokenizer_class is not None: config_tokenizer_class = config_tokenizer_class.__name__ else: config_tokenizer_class = config_tokenizer_class_fast.__name__

Thank you for a excellent suggestion!

src/transformers/tokenization_utils_base.py

SaulLu · 2021-07-13T10:30:54Z

src/transformers/tokenization_utils_base.py

+                    "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. "
+                    "It may result in unexpected tokenization. \n"
+                    f"The tokenizer class you load from this checkpoint is '{config_tokenizer_class}'. \n"
+                    f"The class this function is called from is '{cls.__name__}'."


Great 👍 ! This will really help future users

SaulLu · 2021-07-13T10:38:48Z

tests/test_tokenization_base.py

+import unittest
+
+from transformers.models.bert.tokenization_bert import BertTokenizer
+from transformers.models.bert_japanese.tokenization_bert_japanese import BertJapaneseTokenizer
+
+
+class ClassMismatchTest(unittest.TestCase):
+    def test_mismatch_error(self):
+        PRETRAINED_MODEL = "cl-tohoku/bert-base-japanese"
+        with self.assertRaises(ValueError):
+            BertTokenizer.from_pretrained(PRETRAINED_MODEL)
+
+    def test_limit_of_match_validation(self):
+        # Can't detect mismatch because this model's config
+        # doesn't have information about the tokenizer model.
+        PRETRAINED_MODEL = "bert-base-uncased"
+        BertJapaneseTokenizer.from_pretrained(PRETRAINED_MODEL)


Thank you very much for writing this test: we immediately understand the new feature!

As the added changes concern all tokenizers, not only BertTokenizer and BertJapaneseTokenizer, I think it would be interesting to test the warning logged on all tokenizers by adding a new test to TokenizerTesterMixin in the test_tokenization_common.py file. This new test could for example look like something like:

def test_tokenizer_mismatch_warning(self): for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): if self.tokenizer_class != BertTokenizer: with self.assertLogs("transformers", level="WARNING") as cm: try: BertTokenizer.from_pretrained(pretrained_name) except (TypeError, AttributeError): # Some tokenizers cannot be loaded into `BertTokenizer` at all and errors are returned, # here we just check that the warning has been logged before the error is raised pass finally: self.assertTrue( cm.records[0].message.startswith( "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from." ) ) if self.rust_tokenizer_class != BertTokenizerFast: with self.assertLogs("transformers", level="WARNING") as cm: try: BertTokenizerFast.from_pretrained(pretrained_name) except (TypeError, AttributeError): # Some tokenizers cannot be loaded into `BertTokenizerFast` at all and errors are returned, # here we just check that the warning has been logged before the error is raised pass finally: self.assertTrue( cm.records[0].message.startswith( "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from." ) )

What do you think?

Ps: I can of course help make this change if needed, especially as an adaptation will have to be made for PreTrainedTokenizerFast

It's an excellent idea, and I'd like to check all tokenizers that include BertTokenizer and BertJapaneseTokenizer at this test. I changed your suggestion to this. Is this missing something needed to test?

def test_tokenizer_mismatch_warning(self): for tokenizer, pretrained_name, kwargs in self.tokenizers_list: with self.subTest(f"{tokenizer.__class__.__name__} ({pretrained_name})"): with self.assertLogs("transformers", level="WARNING") as cm: try: if self.tokenizer_class == BertTokenizer: AlbertTokenizer.from_pretrained(pretrained_name) else: BertTokenizer.from_pretrained(pretrained_name) except (TypeError, AttributeError): # Some tokenizers cannot be loaded into the target tokenizer at all and errors are returned, # here we just check that the warning has been logged before the error is raised pass finally: self.assertTrue( cm.records[0].message.startswith( "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from." ) ) try: if self.rust_tokenizer_class == BertTokenizerFast: AlbertTokenizerFast.from_pretrained(pretrained_name) else: BertTokenizerFast.from_pretrained(pretrained_name) except (TypeError, AttributeError): # Some tokenizers cannot be loaded into the target tokenizer at all and errors are returned, # here we just check that the warning has been logged before the error is raised pass finally: self.assertTrue( cm.records[0].message.startswith( "The tokenizer class you load from this checkpoint is not the same type as the class this function is called from." ) )

Looks great to me! 🙂

src/transformers/tokenization_utils_base.py

SaulLu · 2021-07-13T12:05:16Z

src/transformers/models/bert_japanese/tokenization_bert_japanese.py

@@ -132,7 +132,7 @@ def __init__(
        if not os.path.isfile(vocab_file):
            raise ValueError(
                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
-                "model use `tokenizer = BertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
+                "model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"


It seems to me that the other tokenizers in the transformers library specify the specific class of tokenizer here instead of the generic AutoTokenizer. Was there any particular reason to prefer AutoTokenizer to BertJapaneseTokenizer? 🙂

Suggested change

"model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"

"model use `tokenizer = BertJapaneseTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"

I don't have a strong opinion about it too. I chose AutoTokenizer because I thought leading a user to AutoTokenizer would avoid a problem like this issue.

I think it's better to encourage users to use the AutoTokenizer class.

Thanks a lot for your feedback @sgugger ! In that case, @europeanplaice your proposal is great - you can ignore my previous comment.

@sgugger, Should we take this opportunity to make the same change with other tokenizers that log the same type of message (cf PR #12745)?

Yes, that was a great idea!

sgugger

Thanks for the PR. I think it needs to be adjusted a bit to keep model configuration independent from tokenizers, as much as possible.

sgugger · 2021-07-13T12:07:48Z

src/transformers/tokenization_utils_base.py

@@ -111,6 +111,7 @@ class EncodingFast:
 SPECIAL_TOKENS_MAP_FILE = "special_tokens_map.json"
 ADDED_TOKENS_FILE = "added_tokens.json"
 TOKENIZER_CONFIG_FILE = "tokenizer_config.json"
+CONFIG_FILE = "config.json"


No this file is the model configuration. It has nothing to do with the tokenizer and should not be put here.

I agree. AutoConfig.from_pretrained makes this line unnecessary.

sgugger · 2021-07-13T12:07:54Z

src/transformers/tokenization_utils_base.py

@@ -1639,6 +1640,7 @@ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike],
                "special_tokens_map_file": SPECIAL_TOKENS_MAP_FILE,
                "tokenizer_config_file": TOKENIZER_CONFIG_FILE,
                "tokenizer_file": FULL_TOKENIZER_FILE,
+                "config_file": CONFIG_FILE,


I agree too.

sgugger · 2021-07-13T12:08:35Z

src/transformers/tokenization_utils_base.py

+        if tokenizer_config_file is None or config_tokenizer_class is None:
+            config_file = resolved_vocab_files.pop("config_file", None)
+            if config_file is not None:
+                with open(config_file, encoding="utf-8") as config_handle:
+                    config_dict = json.load(config_handle)
+                config_tokenizer_class = config_dict.get("tokenizer_class")


We should rely on AutoConfig.from_pretrained for this blob (inside a try block).

Thank you for your review. It is better than my code, and I avoided a circular import by importing AutoConfig inside _from_pretrained not at the top level (thanks to #12619 (comment)).

sgugger · 2021-07-13T12:10:13Z

tests/test_tokenization_base.py

@@ -0,0 +1,17 @@
+import unittest
+
+from transformers.models.bert.tokenization_bert import BertTokenizer


This test should go in an existing test file, for instance the one already testing BertJapaneseTokenizer or common tokenizer test file.

I'll remove test_tokenization_base.py, and introduce #12619 (comment) 's test instead of this.

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

…aice/transformers into tokenizer_class_check

europeanplaice · 2021-07-16T14:23:03Z

I revised the code based on your reviews.

sgugger

There is still a last failure in the tests, I left a pointer below from where it originates.

I'm thinking this logic of getting the tokenizer class (that is used here and in the AutoClass) could probably be refactored in a function, I can do that in a follow up if you prefer. Let me know!

sgugger · 2021-07-16T14:52:11Z

src/transformers/tokenization_utils_base.py

+
+            # Second attempt. If we have not yet found tokenizer_class, let's try to use the config.
+            try:
+                config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)


Passing the kwargs along here seems to break the tests. Also I don't think we need them?

europeanplaice · 2021-07-16T15:01:11Z

I want to ask you to refactor the logic.
Thank you for offering!

sgugger · 2021-07-17T08:12:09Z

@SaulLu could you confirm you're happy with the changes? I think this is good to be merged on my side, thanks for the adjustments @europeanplaice.

SaulLu

I also share your opinion @sgugger ! Thanks a lot for the addition @europeanplaice , it's great to have this warning logged!

SaulLu · 2021-07-17T09:04:53Z

tests/test_tokenization_fast.py

+    def test_tokenizer_mismatch_warning(self):
+        # We disable this test for PreTrainedTokenizerFast because it is the only tokenizer that is not linked to any
+        # model
+        pass


europeanplaice · 2021-07-17T13:34:39Z

@SaulLu @sgugger
We made a excellent job! Thank you very much for your help!

europeanplaice added 3 commits July 10, 2021 06:20

Detect mismatch by analyzing config

de3725f

Fix comment

4fc28ed

Fix import

faa4fa3

LysandreJik requested review from sgugger, LysandreJik and SaulLu July 13, 2021 07:50

SaulLu suggested changes Jul 13, 2021

View reviewed changes

sgugger reviewed Jul 13, 2021

View reviewed changes

Update src/transformers/tokenization_utils_base.py

05b43e8

Co-authored-by: SaulLu <55560583+SaulLu@users.noreply.github.com>

SaulLu mentioned this pull request Jul 15, 2021

Replace specific tokenizer in log message by AutoTokenizer #12745

Merged

europeanplaice added 4 commits July 16, 2021 22:05

Revise based on reviews

b998ac5

Merge remote-tracking branch 'origin/master' into tokenizer_class_check

b2db7c5

Merge branch 'huggingface:master' into tokenizer_class_check

a2050ba

Merge branch 'tokenizer_class_check' of https://github.com/europeanpl…

7d33c28

…aice/transformers into tokenizer_class_check

sgugger reviewed Jul 16, 2021

View reviewed changes

europeanplaice added 4 commits July 17, 2021 13:57

remove kwargs

c440af2

Fix exception

634e31d

Fix handling exception again

f36c44f

Disable mismatch test in PreTrainedTokenizerFast

2ac9387

SaulLu self-requested a review July 17, 2021 09:03

SaulLu approved these changes Jul 17, 2021

View reviewed changes

sgugger merged commit 08d609b into huggingface:master Jul 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tokenizers class mismatch detection between `cls` and checkpoint #12619

Add tokenizers class mismatch detection between `cls` and checkpoint #12619

europeanplaice commented Jul 9, 2021

SaulLu left a comment

SaulLu Jul 13, 2021 •

edited

Loading

europeanplaice Jul 16, 2021

SaulLu Jul 13, 2021

SaulLu Jul 13, 2021

europeanplaice Jul 14, 2021

SaulLu Jul 15, 2021

SaulLu Jul 13, 2021

europeanplaice Jul 13, 2021

sgugger Jul 14, 2021

SaulLu Jul 15, 2021

sgugger Jul 15, 2021

sgugger left a comment

sgugger Jul 13, 2021

europeanplaice Jul 14, 2021

sgugger Jul 13, 2021

europeanplaice Jul 14, 2021

sgugger Jul 13, 2021

europeanplaice Jul 14, 2021 •

edited

Loading

sgugger Jul 13, 2021

europeanplaice Jul 14, 2021

europeanplaice commented Jul 16, 2021

sgugger left a comment

sgugger Jul 16, 2021

europeanplaice commented Jul 16, 2021

sgugger commented Jul 17, 2021

SaulLu left a comment

SaulLu Jul 17, 2021

europeanplaice commented Jul 17, 2021

	"model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"
	"model use `tokenizer = BertJapaneseTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`"

		@@ -0,0 +1,17 @@
		import unittest

		from transformers.models.bert.tokenization_bert import BertTokenizer

Add tokenizers class mismatch detection between cls and checkpoint #12619

Add tokenizers class mismatch detection between cls and checkpoint #12619

Conversation

europeanplaice commented Jul 9, 2021

What does this PR do?

Before submitting

Who can review?

SaulLu left a comment

Choose a reason for hiding this comment

SaulLu Jul 13, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

europeanplaice Jul 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

europeanplaice commented Jul 16, 2021

sgugger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

europeanplaice commented Jul 16, 2021

sgugger commented Jul 17, 2021

SaulLu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

europeanplaice commented Jul 17, 2021

Add tokenizers class mismatch detection between `cls` and checkpoint #12619

Add tokenizers class mismatch detection between `cls` and checkpoint #12619

SaulLu Jul 13, 2021 •

edited

Loading

europeanplaice Jul 14, 2021 •

edited

Loading