-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve saving strategy of sentencepiece tokenizer #15328
improve saving strategy of sentencepiece tokenizer #15328
Conversation
… file was deleted
The documentation is not available anymore as the PR was closed or merged. |
elif not os.path.isfile(self.vocab_file): | ||
with open(out_vocab_file, "wb") as fi: | ||
content_spiece_model = self.sp_model.serialized_model_proto() | ||
fi.write(content_spiece_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have "manually" checked that the content was identical between one original file https://huggingface.co/albert-base-v1/blob/main/spiece.model and the file saved with these lines of codes for versions 0.1.91
to 0.1.96
of sentencepiece
self.fairseq_tokens_to_ids = { | ||
token: token_id | ||
for token_id, token in enumerate( | ||
dict.fromkeys( | ||
[str(bos_token), str(pad_token), str(eos_token), str(unk_token), str(sep_token), str(cls_token)] | ||
).keys() | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are ancillary changes to the PR to remove values that were hardcoded in the tokenizer. They are nevertheless relevant for this PR in order to be able to re-generate the monolingual_vocab_file
file (without risking to make mistakes) when saving.
if str(mask_token) not in self.fairseq_tokens_to_ids: | ||
self.fairseq_tokens_to_ids[str(mask_token)] = len(self.fairseq_tokens_to_ids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem as above
@@ -278,7 +288,7 @@ def _convert_token_to_id(self, token): | |||
if token in self.fairseq_tokens_to_ids: | |||
return self.fairseq_tokens_to_ids[token] | |||
else: | |||
return self.fairseq_tokens_to_ids["<unk>"] | |||
return self.unk_token_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Idem as above
@@ -39,6 +39,7 @@ class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase): | |||
tokenizer_class = MBartTokenizer | |||
rust_tokenizer_class = MBartTokenizerFast | |||
test_rust_tokenizer = True | |||
test_sentencepiece = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mbart
was not flagged as a sentencepiece tokenizer in the tests until now 🙂
…rategy-sentencepiece-tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this, it looks great!
…ial tokens Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
f4fb2be
to
ec4ab8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fixes, @SaulLu! This looks good to me.
What does this PR do?
The slow tokenizer based on sentencepiece until now needed to access the original files (like
xxx/spiece.model
) that had been used to initialize them when we wanted to save them.Since the version
0.1.91
of sentencepiece, there is a new methodserialized_model_proto
available which allows to create directly the sentence piece model file from the python object used by our tokenizer.This PR proposes to modify all the tokenizers based on sentencepiece to use this new method if the original file(s) is not be accessible anymore. A new test also test this new capability.
Additional comments
In this PR I also modified:
BartphoTokenizer
so that some special tokens are not hardcoded anymoreM2M100Tokenizer
andMarianTokenizer
, so that their saving method looks more like the other tokenizersMotivation
I think that when it is possible it is good to be able to save our object even if some files used during the initialization do not exist anymore.
Moreover, this addition allows me to make easier the creation of other tests (like the test of this PR).
Who can review?
Anyone in the community is free to review the PR once the tests have passed. I would in particular love to read your thoughts @LysandreJik or @sgugger