improve saving strategy of sentencepiece tokenizer #15328

SaulLu · 2022-01-25T15:39:59Z

What does this PR do?

The slow tokenizer based on sentencepiece until now needed to access the original files (like xxx/spiece.model) that had been used to initialize them when we wanted to save them.

Since the version 0.1.91 of sentencepiece, there is a new method serialized_model_proto available which allows to create directly the sentence piece model file from the python object used by our tokenizer.

This PR proposes to modify all the tokenizers based on sentencepiece to use this new method if the original file(s) is not be accessible anymore. A new test also test this new capability.

Additional comments

In this PR I also modified:

BartphoTokenizer so that some special tokens are not hardcoded anymore
M2M100Tokenizer and MarianTokenizer, so that their saving method looks more like the other tokenizers

Motivation

I think that when it is possible it is good to be able to save our object even if some files used during the initialization do not exist anymore.

Moreover, this addition allows me to make easier the creation of other tests (like the test of this PR).

Who can review?

Anyone in the community is free to review the PR once the tests have passed. I would in particular love to read your thoughts @LysandreJik or @sgugger

… file was deleted

HuggingFaceDocBuilder · 2022-01-25T15:40:25Z

The documentation is not available anymore as the PR was closed or merged.

SaulLu · 2022-01-25T15:58:52Z

src/transformers/models/albert/tokenization_albert.py

+        elif not os.path.isfile(self.vocab_file):
+            with open(out_vocab_file, "wb") as fi:
+                content_spiece_model = self.sp_model.serialized_model_proto()
+                fi.write(content_spiece_model)


I have "manually" checked that the content was identical between one original file https://huggingface.co/albert-base-v1/blob/main/spiece.model and the file saved with these lines of codes for versions 0.1.91 to 0.1.96 of sentencepiece

SaulLu · 2022-01-25T16:00:58Z

src/transformers/models/bartpho/tokenization_bartpho.py

+        self.fairseq_tokens_to_ids = {
+            token: token_id
+            for token_id, token in enumerate(
+                dict.fromkeys(
+                    [str(bos_token), str(pad_token), str(eos_token), str(unk_token), str(sep_token), str(cls_token)]
+                ).keys()
+            )
+        }


These are ancillary changes to the PR to remove values that were hardcoded in the tokenizer. They are nevertheless relevant for this PR in order to be able to re-generate the monolingual_vocab_file file (without risking to make mistakes) when saving.

SaulLu · 2022-01-25T16:01:09Z

src/transformers/models/bartpho/tokenization_bartpho.py

+        if str(mask_token) not in self.fairseq_tokens_to_ids:
+            self.fairseq_tokens_to_ids[str(mask_token)] = len(self.fairseq_tokens_to_ids)


Idem as above

SaulLu · 2022-01-25T16:01:15Z

src/transformers/models/bartpho/tokenization_bartpho.py

@@ -278,7 +288,7 @@ def _convert_token_to_id(self, token):
        if token in self.fairseq_tokens_to_ids:
            return self.fairseq_tokens_to_ids[token]
        else:
-            return self.fairseq_tokens_to_ids["<unk>"]
+            return self.unk_token_id


Idem as above

SaulLu · 2022-01-25T16:15:32Z

tests/test_tokenization_mbart.py

@@ -39,6 +39,7 @@ class MBartTokenizationTest(TokenizerTesterMixin, unittest.TestCase):
    tokenizer_class = MBartTokenizer
    rust_tokenizer_class = MBartTokenizerFast
    test_rust_tokenizer = True
+    test_sentencepiece = True


mbart was not flagged as a sentencepiece tokenizer in the tests until now 🙂

…rategy-sentencepiece-tokenizer

sgugger

Thanks a lot for working on this, it looks great!

src/transformers/models/bartpho/tokenization_bartpho.py

This reverts commit 37a40df.

src/transformers/models/bartpho/tokenization_bartpho.py

…ial tokens Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

LysandreJik

Thank you for the fixes, @SaulLu! This looks good to me.

SaulLu added 12 commits January 25, 2022 12:11

add new test

ba4edae

add a feature to same the sentencepiece tokenizer model when the init…

4d7c6e1

… file was deleted

update marian

2050cf9

update m2m_100

4782836

fix marian

75b4151

update speech to text

a8cd02c

override test for layoutxlm

d2e86a9

fix saving bartpho

d42c6ba

remove harcoded values bartpho

a789569

special token string version

0faf5ee

finish bartpho

d61962b

override layoutxml test

749756a

SaulLu commented Jan 25, 2022

View reviewed changes

SaulLu requested review from sgugger and LysandreJik January 25, 2022 16:04

add mbart

b100bb5

SaulLu commented Jan 25, 2022

View reviewed changes

Merge remote-tracking branch 'upstream/master' into improve-saving-st…

d2e4330

…rategy-sentencepiece-tokenizer

sgugger approved these changes Jan 25, 2022

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

src/transformers/models/bartpho/tokenization_bartpho.py Outdated Show resolved Hide resolved

SaulLu added 4 commits January 25, 2022 18:09

move special tokens list

0e89b27

format

37a40df

Revert "format"

2e0e892

This reverts commit 37a40df.

simplify list of string of special tokens

c986e6a

sgugger reviewed Jan 25, 2022

View reviewed changes

src/transformers/models/bartpho/tokenization_bartpho.py Show resolved Hide resolved

Re-write self.fairseq_tokens_to_ids initialization logic with spec…

ec4ab8c

…ial tokens Co-authored-by: Sylvain Gugger <sylvain.gugger@gmail.com>

SaulLu force-pushed the improve-saving-strategy-sentencepiece-tokenizer branch from f4fb2be to ec4ab8c Compare January 25, 2022 18:54

SaulLu mentioned this pull request Jan 26, 2022

Saved slow tokenizers cannot be loaded in AutoTokenizer after environment change #15283

Closed

LysandreJik approved these changes Jan 27, 2022

View reviewed changes

SaulLu merged commit ade7371 into huggingface:master Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve saving strategy of sentencepiece tokenizer #15328

improve saving strategy of sentencepiece tokenizer #15328

SaulLu commented Jan 25, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 25, 2022 •

edited

Loading

SaulLu Jan 25, 2022 •

edited

Loading

SaulLu Jan 25, 2022

SaulLu Jan 25, 2022

SaulLu Jan 25, 2022

SaulLu Jan 25, 2022

sgugger left a comment

LysandreJik left a comment

		if str(mask_token) not in self.fairseq_tokens_to_ids:
		self.fairseq_tokens_to_ids[str(mask_token)] = len(self.fairseq_tokens_to_ids)

improve saving strategy of sentencepiece tokenizer #15328

improve saving strategy of sentencepiece tokenizer #15328

Conversation

SaulLu commented Jan 25, 2022 • edited Loading

What does this PR do?

Additional comments

Motivation

Who can review?

HuggingFaceDocBuilder commented Jan 25, 2022 • edited Loading

SaulLu Jan 25, 2022 • edited Loading

Choose a reason for hiding this comment

SaulLu Jan 25, 2022

Choose a reason for hiding this comment

SaulLu Jan 25, 2022

Choose a reason for hiding this comment

SaulLu Jan 25, 2022

Choose a reason for hiding this comment

SaulLu Jan 25, 2022

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

SaulLu commented Jan 25, 2022 •

edited

Loading

HuggingFaceDocBuilder commented Jan 25, 2022 •

edited

Loading

SaulLu Jan 25, 2022 •

edited

Loading