Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve saving strategy of sentencepiece tokenizer #15328

Merged

Conversation

SaulLu
Copy link
Contributor

@SaulLu SaulLu commented Jan 25, 2022

What does this PR do?

The slow tokenizer based on sentencepiece until now needed to access the original files (like xxx/spiece.model) that had been used to initialize them when we wanted to save them.

Since the version 0.1.91 of sentencepiece, there is a new method serialized_model_proto available which allows to create directly the sentence piece model file from the python object used by our tokenizer.

This PR proposes to modify all the tokenizers based on sentencepiece to use this new method if the original file(s) is not be accessible anymore. A new test also test this new capability.

Additional comments

In this PR I also modified:

  • BartphoTokenizer so that some special tokens are not hardcoded anymore
  • M2M100Tokenizer and MarianTokenizer, so that their saving method looks more like the other tokenizers

Motivation

I think that when it is possible it is good to be able to save our object even if some files used during the initialization do not exist anymore.

Moreover, this addition allows me to make easier the creation of other tests (like the test of this PR).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. I would in particular love to read your thoughts @LysandreJik or @sgugger

@HuggingFaceDocBuilder
Copy link

HuggingFaceDocBuilder commented Jan 25, 2022

The documentation is not available anymore as the PR was closed or merged.

Comment on lines +348 to +351
elif not os.path.isfile(self.vocab_file):
with open(out_vocab_file, "wb") as fi:
content_spiece_model = self.sp_model.serialized_model_proto()
fi.write(content_spiece_model)
Copy link
Contributor Author

@SaulLu SaulLu Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have "manually" checked that the content was identical between one original file https://huggingface.co/albert-base-v1/blob/main/spiece.model and the file saved with these lines of codes for versions 0.1.91 to 0.1.96 of sentencepiece

Comment on lines 162 to 169
self.fairseq_tokens_to_ids = {
token: token_id
for token_id, token in enumerate(
dict.fromkeys(
[str(bos_token), str(pad_token), str(eos_token), str(unk_token), str(sep_token), str(cls_token)]
).keys()
)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are ancillary changes to the PR to remove values that were hardcoded in the tokenizer. They are nevertheless relevant for this PR in order to be able to re-generate the monolingual_vocab_file file (without risking to make mistakes) when saving.

Comment on lines +174 to +175
if str(mask_token) not in self.fairseq_tokens_to_ids:
self.fairseq_tokens_to_ids[str(mask_token)] = len(self.fairseq_tokens_to_ids)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above

@@ -278,7 +288,7 @@ def _convert_token_to_id(self, token):
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
else:
return self.fairseq_tokens_to_ids["<unk>"]
return self.unk_token_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idem as above

@SaulLu SaulLu requested review from sgugger and LysandreJik January 25, 2022 16:04
@@ -39,6 +39,7 @@ class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
tokenizer_class = MBartTokenizer
rust_tokenizer_class = MBartTokenizerFast
test_rust_tokenizer = True
test_sentencepiece = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mbart was not flagged as a sentencepiece tokenizer in the tests until now 🙂

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on this, it looks great!

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved
src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved
…ial tokens

Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>
@SaulLu SaulLu force-pushed the improve-saving-strategy-sentencepiece-tokenizer branch from f4fb2be to ec4ab8c Compare January 25, 2022 18:54
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fixes, @SaulLu! This looks good to me.

@SaulLu SaulLu merged commit ade7371 into huggingface:master Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants