Add more tests on tokenizers serialization - fix bugs #5056

thomwolf · 2020-06-16T11:55:41Z

Adds more tests on tokenizer serialization (test when adding tokens, special tokens, etc).

Tokenizer's serialization was not thoroughly tested and actually had quite some holes and bugs. Fix related issues.

codecov · 2020-06-24T11:26:29Z

Codecov Report

Merging #5056 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #5056      +/-   ##
==========================================
+ Coverage   77.96%   78.02%   +0.05%     
==========================================
  Files         138      138              
  Lines       23838    23847       +9     
==========================================
+ Hits        18585    18606      +21     
+ Misses       5253     5241      -12

Impacted Files	Coverage Δ
src/transformers/tokenization_utils.py	`92.47% <100.00%> (+0.96%)`	⬆️
src/transformers/tokenization_utils_base.py	`92.82% <100.00%> (+1.95%)`	⬆️
src/transformers/tokenization_utils_fast.py	`94.28% <100.00%> (+2.31%)`	⬆️
src/transformers/trainer.py	`38.38% <0.00%> (-1.19%)`	⬇️
src/transformers/modeling_tf_utils.py	`86.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b28b537...8f87f25. Read the comment docs.

thomwolf · 2020-06-24T11:27:06Z

src/transformers/tokenization_utils.py

+        # until the serialization of Fast tokenizers is updated
+        self.added_tokens_encoder: Dict[str, int] = {}
+        self.added_tokens_decoder: Dict[int, str] = {}
+        self.unique_no_split_tokens: List[str] = []


Some of the tokens we want to avoid splitting on are actually not added tokens but tokens already in the base vocabulary (e.g. [MASK] is in Albert vocab but if we don't take special care of it, it will be split by SentencePiece magic in [, MASK, ]🙃).

I renamed this internal variable to make this more clear.

That's clean!

thomwolf · 2020-06-24T11:30:04Z

src/transformers/tokenization_utils_base.py

+                assert index == len(tokenizer), (
+                    f"Non-consecutive added token '{token}' found. "
+                    f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
+                )


This will now raise an error if non-consecutive tokens are provided in the serialized vocabulary.

I contemplated making this a warning only but I think it's better to enforce good practices here rather than keep backward compatibility. If your vocabulary has "holes" in it something went wrong somewhere and reassigning the token to a new index will be a source of silent errors.

Won't this fail with CTRL? I recall hearing that CTRL had such an issue

It works the same in tokenizers, except that I went with warnings. My thinking was: If this is happening, it means the user just modified the file manually or just got it from someone so she will try to load it and see the warnings right away since this is happening at the very beginning.
Could make this an error too quite easily though, as I agree that this is probably better!

LysandreJik

This is very cool. Great to have some new tests!

LysandreJik · 2020-06-24T14:22:46Z

src/transformers/tokenization_utils.py

+        # until the serialization of Fast tokenizers is updated
+        self.added_tokens_encoder: Dict[str, int] = {}
+        self.added_tokens_decoder: Dict[int, str] = {}
+        self.unique_no_split_tokens: List[str] = []


That's clean!

LysandreJik · 2020-06-24T14:25:35Z

src/transformers/tokenization_utils_base.py

+                assert index == len(tokenizer), (
+                    f"Non-consecutive added token '{token}' found. "
+                    f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
+                )


Won't this fail with CTRL? I recall hearing that CTRL had such an issue

LysandreJik · 2020-06-24T14:42:05Z

tests/test_tokenization_common.py

@@ -156,28 +156,62 @@ def test_tokenizers_common_properties(self):

    def test_save_and_load_tokenizer(self):


Maybe this test could be split in a few different tests later on? It's starting to get a bit thick.

n1t0

LGTM!

n1t0 · 2020-06-24T14:59:56Z

src/transformers/tokenization_utils_base.py

+                assert index == len(tokenizer), (
+                    f"Non-consecutive added token '{token}' found. "
+                    f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
+                )


It works the same in tokenizers, except that I went with warnings. My thinking was: If this is happening, it means the user just modified the file manually or just got it from someone so she will try to load it and see the warnings right away since this is happening at the very beginning.
Could make this an error too quite easily though, as I agree that this is probably better!

* update tests for fast tokenizers + fix small bug in saving/loading * better tests on serialization * fixing serialization * comment cleanup

thomwolf added 4 commits June 16, 2020 11:32

update tests for fast tokenizers + fix small bug in saving/loading

06179f6

better tests on serialization

94cf11c

Merge branch 'master' into serial

ea9cdf1

fixing serialization

90fcc62

thomwolf commented Jun 24, 2020

View reviewed changes

comment cleanup

8f87f25

thomwolf commented Jun 24, 2020

View reviewed changes

thomwolf requested review from LysandreJik and n1t0 June 24, 2020 11:30

thomwolf marked this pull request as ready for review June 24, 2020 11:31

thomwolf mentioned this pull request Jun 24, 2020

BART(base) - Finetune Is this a bug ? Or I am doing something wrong? #5237

Closed

LysandreJik approved these changes Jun 24, 2020

View reviewed changes

n1t0 approved these changes Jun 24, 2020

View reviewed changes

thomwolf merged commit 7ac9110 into master Jun 24, 2020

thomwolf deleted the serial branch June 24, 2020 19:53

SaulLu mentioned this pull request Jan 26, 2022

Saved slow tokenizers cannot be loaded in AutoTokenizer after environment change #15283

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more tests on tokenizers serialization - fix bugs #5056

Add more tests on tokenizers serialization - fix bugs #5056

thomwolf commented Jun 16, 2020 •

edited

Loading

codecov bot commented Jun 24, 2020 •

edited

Loading

thomwolf Jun 24, 2020 •

edited

Loading

LysandreJik Jun 24, 2020

thomwolf Jun 24, 2020

LysandreJik Jun 24, 2020

n1t0 Jun 24, 2020

LysandreJik left a comment

LysandreJik Jun 24, 2020

LysandreJik Jun 24, 2020

LysandreJik Jun 24, 2020

n1t0 left a comment

n1t0 Jun 24, 2020

		@@ -156,28 +156,62 @@ def test_tokenizers_common_properties(self):

		def test_save_and_load_tokenizer(self):

Add more tests on tokenizers serialization - fix bugs #5056

Add more tests on tokenizers serialization - fix bugs #5056

Conversation

thomwolf commented Jun 16, 2020 • edited Loading

codecov bot commented Jun 24, 2020 • edited Loading

Codecov Report

thomwolf Jun 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

n1t0 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thomwolf commented Jun 16, 2020 •

edited

Loading

codecov bot commented Jun 24, 2020 •

edited

Loading

thomwolf Jun 24, 2020 •

edited

Loading