-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Fast tokenizer for debertaV2 #14928
Conversation
index of the token comprising a given character or the span of characters corresponding to a given token). Currently | ||
no "Fast" implementation is available for the SentencePiece-based tokenizers (for T5, ALBERT, CamemBERT, XLM-RoBERTa | ||
and XLNet models). | ||
index of the token comprising a given character or the span of characters corresponding to a given token). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this change is not technically related to the PR, but I found this bit of the documentation to be outdated. If you want a separate PR, lmk.
I noticed that while I was working on my PR, another was submitted for the same purpose: #14923. |
Hey @alcinos , thanks for adding it! I'm currently running comparisons between slow and fast tokenizer. Here are some mismatches between fast and slow. I just run tokenization tests on import sys
from transformers import DebertaV2Tokenizer, DebertaV2TokenizerFast
model_name = "microsoft/deberta-v2-xlarge"
slow_tokenizer = DebertaV2Tokenizer.from_pretrained(model_name)
fast_tokenizer = DebertaV2TokenizerFast.from_pretrained(model_name)
filename = sys.argv[1]
with open(filename, "rt") as f_p:
for line in f_p:
line = line.rstrip()
if not line:
continue
slow_tokens = slow_tokenizer.tokenize(line)
fast_tokens = fast_tokenizer.tokenize(line)
if slow_tokens != fast_tokens:
print("Tokenization mismatch:", line)
print("Slow tokens:", slow_tokens)
print("Fast tokens:", fast_tokens) Here are some mismatches: Original input: Another example on Original input: The original DeBERTa tokenizer outputs the same tokens as the slow tokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for working to resolve this issue @alcinos ! I see that you have already understood how our tokenizers were designated: that's really great! 😄
I took the liberty to leave a little comment on a particular point of this tokenizer, don't hesitate if you need more information about it.
By the way, I noticed that you said "This is a draft as there are some failing tests (not super clear to me why atm, will have to investigate)". Are all the tests that don't pass not clear to you, or only some? In any case, we can list together the tests that do not pass in this PR and discuss how to solve these problems.
PS: I prefer to warn you that I will probably not be very active this week on github, as I'm on vacations.
do_lower_case=False, | ||
split_by_punct=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that these arguments will require to change respectively the normalizer
and the pre_tokenizer
of the backend_tokenizer
object.
To start with, I would advise to add a specific test for these arguments which would allow to check that the tokenization is identical between the slow tokenizer and the fast tokenizer for all possible values for these arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @SaulLu
Thanks a lot for your review.
As for the lower_case arg, I followed Alberts’s tokenizer. As you mentioned, Albert has a modification to the normalizer in the converter:
transformers/src/transformers/convert_slow_tokenizer.py
Lines 502 to 511 in 501307b
def normalizer(self, proto): | |
list_normalizers = [ | |
normalizers.Replace("``", '"'), | |
normalizers.Replace("''", '"'), | |
] | |
if not self.original_tokenizer.keep_accents: | |
list_normalizers.append(normalizers.NFKD()) | |
list_normalizers.append(normalizers.StripAccents()) | |
if self.original_tokenizer.do_lower_case: | |
list_normalizers.append(normalizers.Lowercase()) |
which I duplicated in my PR:
transformers/src/transformers/convert_slow_tokenizer.py
Lines 825 to 826 in c6f2e1b
if self.original_tokenizer.do_lower_case: | |
list_normalizers.append(normalizers.Lowercase()) |
Eyeballing the init
method of PreTrainedTokenizerFast
makes me believe the creation process always involves using the said slow->fast conversion method, so that should be covered?
As for split_by_punct
we could take the same approach and overload the pre-tokenizer method of the converter? Would a sequence [MetaSpace, Punct] do the trick? I’m a bit uncertain here since there doesn’t seem to be any other converter that seem to be dealing with punctuation splitting so maybe I’m understanding this wrong.
EDIT: forgot to mention, tests are indeed a good idea, will see what is the best way to test this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eyeballing the init method of PreTrainedTokenizerFast makes me believe the creation process always involves using the said slow->fast conversion method, so that should be covered?
Indeed, for the first use case we initialize the fast tokenizer by using the conversion script. However, we also want to be able to initialize this fast tokenizer from the fast files only. It will thus be necessary to also modify the backend_tokenizer
in the __init__
method. If ever it is useful, here are the lines where it is done for Bert. Be careful, bert has a custom normalizer just for it so we should adapt these lines to the normalizer of deberta-v2.
(note that you allowed me to notice that we should have the same kind of thing for Albert, I'll open a new PR for that).
As for split_by_punct we could take the same approach and overload the pre-tokenizer method of the converter? Would a sequence [MetaSpace, Punct] do the trick? I’m a bit uncertain here since there doesn’t seem to be any other converter that seem to be dealing with punctuation splitting so maybe I’m understanding this wrong.
It is indeed exactly the same approach that I would have tested first. However, I can't confirm that the pre_tokenizers.Punctuation module behaves exactly like the slow tokenizer feature. But some tests should answer this question 😄
@stefan-it Thanks for looking into this and providing the testcases. It seems that the issues you are reporting are all related to unknown tokens? I don’t know the rust implementation well enough, is there any reason why the fast tokenizer would not respect the vocabulary? |
Hey @alcinos I'm currently trying to figure it out :) |
Good news: when using For slow tokenizer, this is happening here: transformers/src/transformers/models/deberta_v2/tokenization_deberta_v2.py Lines 326 to 330 in 501307b
Also when using normal T5 (Slow and fast) there are no UNKs when using the |
@alcinos I think the issue has something regarding the tokenize function inherited from For this line: the transformers/src/transformers/tokenization_utils_fast.py Lines 316 to 317 in 501307b
but the .tokens() at the end of the line doesn't return the [UNK] token but rather the token itself which causes the discrepancy here. May not be entirely correct - but I was able to fix the discrepancy in the testing script @stefan-it provided by overriding the tokenize function and adding this method in Debertav2TokenizerFast class
But the other tests are still failing and I'm not sure what's causing the issue and need to investigate. |
Thanks @stefan-it and @mingboiz for looking into the tokenization issue. If I summarize the findings so far:
I’m not sure what the expected behavior should be, nor whether we should be concerned about this in the first place. Input from someone from HF would be appreciated :) (ping @SaulLu ) Aside from that, I pushed some fixes, more tests are passing. transformers/tests/test_tokenization_common.py Lines 626 to 627 in 10fd4fa
This is problematic in my opinion since:
I would suggest one of the following change to make this more dev friendly:
More tests are failing: one most likely has the same root cause as the issue raised by @stefan-it. The others seem to be failing because in some code paths the vocab_file is None, but it’s not clear to me why that happens, any help on that appreciated. |
@alcinos I can't figure out a solution yet but the tests that are failing because of the missing vocab file which I think it's because in all of them legacy_format=False is being selected transformers/tests/test_tokenization_common.py Line 3513 in f80775d
which only saves using the Rust Tokenizer these files without the spm.model vocab file:
this code chunk will run instead without using the save_vocabulary function: transformers/src/transformers/tokenization_utils_fast.py Lines 578 to 583 in f80775d
Debertav2Tokenizer didn't have this issue because its backend SPMTokenizer class provided its own
|
I have noted the ping! I'll come back to you as soon as possible on this subject because the choice to be made here is not obvious: you have highlighted that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for all your work, I'm sorry I couldn't be very responsive last week.
So, to bounce around a few points:
I’m not sure what the expected behavior should be, nor whether we should be concerned about this in the first place. Input from someone from HF would be appreciated :)
@alicnos, @minboiz, One of the maintainers also shares my opinion, we think here that it would be better to modify the tokenize
method of the slow version of the tokenizer so that it looks like the one of other tokenizers.
Some feedback for the HF team on the issue I ran into:
@alcinos , thanks a lot for sharing your opinion, it's extremely useful to know the difficulties you encounter.
Indeed, when you add a new tokenizer, an error like the one you mention with do_lower_case
happens quickly because all the tests in test_tokenization_commons.py
are not useful for all the tokenizers (and can be silently passed). My advice is to voluntarily introduce an error (raise ValueError("to delete")
in the most used methods (like encode
, tokenize
, __call__
) in order to see which tests are really executed and which are not. I then check that it is normal if some tests are not executed .
Unfortunately, not all tokenizers in the library have a do_lower_case
argument and that's why this is not a hard requirement.
Additionally, IMHO this would be better suited as an overridable getter method rather than a direct access to a private attribute
@alcinos , I'm not sure what you are referring to here. What private attribute are you talking about? 🙂
DebertaV2TokenizationTest.test_saving_tokenizer_trainer test fails
I'll try to find out more about what's going on. From what I have seen this is due to the fact that on the first save, the key and the value 'tokenizer_file': None
are saved in the tokenizer_config.json
(so they are not overridden afterwards because of this line). What I need to check again is why this key and value are saved with deberta-v2 but not with albert.
if not os.path.isfile(vocab_file): | ||
raise ValueError( | ||
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained " | ||
"model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these lines should be removed. Indeed, for all the tokenizers having a slow version and a fast version we wish to leave the possibility of initializing the tokenizer starting from the two types of files: the files of the slow version or the files of the fast version. It seems to me that these lines would prevent to initialize a deberta-v2 fast tokenizer with only fast files.
if not os.path.isfile(vocab_file): | |
raise ValueError( | |
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained " | |
"model use `tokenizer = AutoTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`" | |
) |
I think that removing these lines could solve the current problem with the test_training_new_tokenizer_with_special_tokens_change
.
do_lower_case=False, | ||
split_by_punct=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eyeballing the init method of PreTrainedTokenizerFast makes me believe the creation process always involves using the said slow->fast conversion method, so that should be covered?
Indeed, for the first use case we initialize the fast tokenizer by using the conversion script. However, we also want to be able to initialize this fast tokenizer from the fast files only. It will thus be necessary to also modify the backend_tokenizer
in the __init__
method. If ever it is useful, here are the lines where it is done for Bert. Be careful, bert has a custom normalizer just for it so we should adapt these lines to the normalizer of deberta-v2.
(note that you allowed me to notice that we should have the same kind of thing for Albert, I'll open a new PR for that).
As for split_by_punct we could take the same approach and overload the pre-tokenizer method of the converter? Would a sequence [MetaSpace, Punct] do the trick? I’m a bit uncertain here since there doesn’t seem to be any other converter that seem to be dealing with punctuation splitting so maybe I’m understanding this wrong.
It is indeed exactly the same approach that I would have tested first. However, I can't confirm that the pre_tokenizers.Punctuation module behaves exactly like the slow tokenizer feature. But some tests should answer this question 😄
@alcinos and @mingboiz , while investigating the
With these 2 changes, the test now pass 😄 ! I have opened a PR here to show you the changes that should be made to solve these problems. Feel free to merge it if you agree with it. 🙂 |
) | ||
|
||
if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): | ||
copyfile(self.vocab_file, out_vocab_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please allow the tokenizer to be also saved, if the file it loaded from is removed?
(see here for example tokenization_albert.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good point!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this already supported in deberta-v2? specifically line 481, 482
transformers/src/transformers/models/deberta_v2/tokenization_deberta_v2.py
Lines 476 to 483 in 5c2d839
def save_pretrained(self, path: str, filename_prefix: str = None): | |
filename = VOCAB_FILES_NAMES[list(VOCAB_FILES_NAMES.keys())[0]] | |
if filename_prefix is not None: | |
filename = filename_prefix + "-" + filename | |
full_path = os.path.join(path, filename) | |
with open(full_path, "wb") as fs: | |
fs.write(self.spm.serialized_model_proto()) | |
return (full_path,) |
Hi @alcinos, thank you very much for your work, the addition seems to be near the end! Please let me know if you need help with any of it! |
Finished in PR #15529 Thanks all again for the contribution 🤗 |
What does this PR do?
Implements a fast tokenizer for deberta v2. Loosely based on #11387
Fixes #11529
Fixes #14712
This is a draft as there are some failing tests (not super clear to me why atm, will have to investigate)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
@LysandreJik