-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more tests on tokenizers serialization - fix bugs #5056
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5056 +/- ##
==========================================
+ Coverage 77.96% 78.02% +0.05%
==========================================
Files 138 138
Lines 23838 23847 +9
==========================================
+ Hits 18585 18606 +21
+ Misses 5253 5241 -12
Continue to review full report at Codecov.
|
# until the serialization of Fast tokenizers is updated | ||
self.added_tokens_encoder: Dict[str, int] = {} | ||
self.added_tokens_decoder: Dict[int, str] = {} | ||
self.unique_no_split_tokens: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the tokens we want to avoid splitting on are actually not added tokens but tokens already in the base vocabulary (e.g. [MASK]
is in Albert vocab but if we don't take special care of it, it will be split by SentencePiece magic in [
, MASK
, ]
🙃).
I renamed this internal variable to make this more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's clean!
assert index == len(tokenizer), ( | ||
f"Non-consecutive added token '{token}' found. " | ||
f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will now raise an error if non-consecutive tokens are provided in the serialized vocabulary.
I contemplated making this a warning only but I think it's better to enforce good practices here rather than keep backward compatibility. If your vocabulary has "holes" in it something went wrong somewhere and reassigning the token to a new index will be a source of silent errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this fail with CTRL? I recall hearing that CTRL had such an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works the same in tokenizers
, except that I went with warnings. My thinking was: If this is happening, it means the user just modified the file manually or just got it from someone so she will try to load it and see the warnings right away since this is happening at the very beginning.
Could make this an error too quite easily though, as I agree that this is probably better!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very cool. Great to have some new tests!
# until the serialization of Fast tokenizers is updated | ||
self.added_tokens_encoder: Dict[str, int] = {} | ||
self.added_tokens_decoder: Dict[int, str] = {} | ||
self.unique_no_split_tokens: List[str] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's clean!
assert index == len(tokenizer), ( | ||
f"Non-consecutive added token '{token}' found. " | ||
f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this fail with CTRL? I recall hearing that CTRL had such an issue
@@ -156,28 +156,62 @@ def test_tokenizers_common_properties(self): | |||
|
|||
def test_save_and_load_tokenizer(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this test could be split in a few different tests later on? It's starting to get a bit thick.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
assert index == len(tokenizer), ( | ||
f"Non-consecutive added token '{token}' found. " | ||
f"Should have index {len(tokenizer)} but has index {index} in saved vocabulary." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works the same in tokenizers
, except that I went with warnings. My thinking was: If this is happening, it means the user just modified the file manually or just got it from someone so she will try to load it and see the warnings right away since this is happening at the very beginning.
Could make this an error too quite easily though, as I agree that this is probably better!
* update tests for fast tokenizers + fix small bug in saving/loading * better tests on serialization * fixing serialization * comment cleanup
Adds more tests on tokenizer serialization (test when adding tokens, special tokens, etc).
Tokenizer's serialization was not thoroughly tested and actually had quite some holes and bugs. Fix related issues.