Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more tests on tokenizers serialization - fix bugs #5056

Merged
merged 5 commits into from
Jun 24, 2020
Merged

Conversation

thomwolf
Copy link
Member

@thomwolf thomwolf commented Jun 16, 2020

Adds more tests on tokenizer serialization (test when adding tokens, special tokens, etc).

Tokenizer's serialization was not thoroughly tested and actually had quite some holes and bugs. Fix related issues.

@codecov
Copy link

codecov bot commented Jun 24, 2020

Codecov Report

Merging #5056 into master will increase coverage by 0.05%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5056      +/-   ##
==========================================
+ Coverage   77.96%   78.02%   +0.05%     
==========================================
  Files         138      138              
  Lines       23838    23847       +9     
==========================================
+ Hits        18585    18606      +21     
+ Misses       5253     5241      -12     
Impacted Files Coverage Δ
src/transformers/tokenization_utils.py 92.47% <100.00%> (+0.96%) ⬆️
src/transformers/tokenization_utils_base.py 92.82% <100.00%> (+1.95%) ⬆️
src/transformers/tokenization_utils_fast.py 94.28% <100.00%> (+2.31%) ⬆️
src/transformers/trainer.py 38.38% <0.00%> (-1.19%) ⬇️
src/transformers/modeling_tf_utils.py 86.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b28b537...8f87f25. Read the comment docs.

# until the serialization of Fast tokenizers is updated
self.added_tokens_encoder: Dict[str, int] = {}
self.added_tokens_decoder: Dict[int, str] = {}
self.unique_no_split_tokens: List[str] = []
Copy link
Member Author

@thomwolf thomwolf Jun 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the tokens we want to avoid splitting on are actually not added tokens but tokens already in the base vocabulary (e.g. [MASK] is in Albert vocab but if we don't take special care of it, it will be split by SentencePiece magic in [, MASK, ]🙃).

I renamed this internal variable to make this more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's clean!

Comment on lines +1298 to +1301
assert index == len(tokenizer), (
f"Non-consecutive added token '{token}' found. "
f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will now raise an error if non-consecutive tokens are provided in the serialized vocabulary.

I contemplated making this a warning only but I think it's better to enforce good practices here rather than keep backward compatibility. If your vocabulary has "holes" in it something went wrong somewhere and reassigning the token to a new index will be a source of silent errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this fail with CTRL? I recall hearing that CTRL had such an issue

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works the same in tokenizers, except that I went with warnings. My thinking was: If this is happening, it means the user just modified the file manually or just got it from someone so she will try to load it and see the warnings right away since this is happening at the very beginning.
Could make this an error too quite easily though, as I agree that this is probably better!

@thomwolf thomwolf requested review from LysandreJik and n1t0 June 24, 2020 11:30
@thomwolf thomwolf marked this pull request as ready for review June 24, 2020 11:31
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very cool. Great to have some new tests!

# until the serialization of Fast tokenizers is updated
self.added_tokens_encoder: Dict[str, int] = {}
self.added_tokens_decoder: Dict[int, str] = {}
self.unique_no_split_tokens: List[str] = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's clean!

Comment on lines +1298 to +1301
assert index == len(tokenizer), (
f"Non-consecutive added token '{token}' found. "
f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this fail with CTRL? I recall hearing that CTRL had such an issue

@@ -156,28 +156,62 @@ def test_tokenizers_common_properties(self):

def test_save_and_load_tokenizer(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this test could be split in a few different tests later on? It's starting to get a bit thick.

Copy link
Member

@n1t0 n1t0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Comment on lines +1298 to +1301
assert index == len(tokenizer), (
f"Non-consecutive added token '{token}' found. "
f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary."
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works the same in tokenizers, except that I went with warnings. My thinking was: If this is happening, it means the user just modified the file manually or just got it from someone so she will try to load it and see the warnings right away since this is happening at the very beginning.
Could make this an error too quite easily though, as I agree that this is probably better!

@thomwolf thomwolf merged commit 7ac9110 into master Jun 24, 2020
@thomwolf thomwolf deleted the serial branch June 24, 2020 19:53
jplu pushed a commit to jplu/transformers that referenced this pull request Jun 29, 2020
* update tests for fast tokenizers + fix small bug in saving/loading

* better tests on serialization

* fixing serialization

* comment cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants