Convert SentencePieceTokenizer and associated models to new assets paradigm #1323

nkovela1 · 2023-11-21T07:07:05Z

This PR converts the SentencePieceTokenizer to use save_assets/load_assets logic and converts the following models to defer associated vocabulary logic:

Albert
DebertaV3
FNet
XLMRoberta
T5

mattdangerw

Looks good! Found a few things.

mattdangerw · 2023-11-21T17:42:21Z

keras_nlp/models/t5/t5_tokenizer.py

+    def set_vocabulary(self, proto):
+        super().set_vocabulary(proto)
+        if proto is not None:
+            for token in [self.pad_token]:


is this a prexisting bug that self.end_token was not included here?

You're right, this is likely a preexisting bug. Fixed!

mattdangerw · 2023-11-21T17:45:56Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+        path = os.path.join(dir_path, VOCAB_FILENAME)
+        self.set_vocabulary(path)
+
+    def set_vocabulary(self, proto):


I think we should just call this set_proto. A little weird, but matches the init arg naming we have today. And this should not be set on the results of get_vocabulary, that would blow up.

word_piece_tokenizer.set_vocabulary sentence_piece_tokenizer.set_proto byte_pair_tokenizer.set_vocabulary_and_merges

Got it, I've switched them to set_proto throughout. Thanks!

mattdangerw · 2023-11-21T17:48:26Z

keras_nlp/tokenizers/sentence_piece_tokenizer.py

+    def save_assets(self, dir_path):
+        path = os.path.join(dir_path, VOCAB_FILENAME)
+        with open(path, "w") as file:
+            for token in self.proto:


This won't work! We need to write the proto to a file. Not sure how best to do this but we might want to spy on https://github.com/google/sentencepiece

I should also figure out how to enable our large testing on this branch, which includes saving. That would catch this issue.

Great catch!

Fixed, I could simply write self.proto to the file directly (since I have set this as a bytes string), and load_assets will simply pass the filepath to set_proto (which has pre-existing capability to handle proto loading from a filepath)

I think our next PR can focus on e2e testing for the save/load assets workflow.

mattdangerw

Looks good! One last potential hiccup.

keras_nlp/tokenizers/sentence_piece_tokenizer.py

…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes

nkovela1 added 7 commits November 21, 2023 04:13

Convert SentencePiece tokenizer to save_assets/load_assets

dd3eb9c

Convert albert to new assets paradigm

19bd886

Convert DebertaV3 to new assets paradigm

cbb64f9

Fix formatting issues

a1dcd98

Convert FNet to new assets paradigm

d7f49e7

Convert XLMRoberta to new assets paradigm

6e56471

Convert T5 Tokenizer to new assets paradigm

a3ba00b

nkovela1 requested a review from mattdangerw November 21, 2023 07:07

Fix sentencepiece tokenizer config test

33f9bff

mattdangerw requested changes Nov 21, 2023

View reviewed changes

Change set_vocabulary to set_proto

dc13ad0

nkovela1 requested a review from mattdangerw November 21, 2023 19:17

Change proto to raw proto

1a2d6a6

mattdangerw approved these changes Nov 21, 2023

View reviewed changes

keras_nlp/tokenizers/sentence_piece_tokenizer.py Outdated Show resolved Hide resolved

Change to proto_bytes

4ced6f3

nkovela1 merged commit 4ca2516 into keras-team:kaggle Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert SentencePieceTokenizer and associated models to new assets paradigm #1323

Convert SentencePieceTokenizer and associated models to new assets paradigm #1323

nkovela1 commented Nov 21, 2023

mattdangerw left a comment

mattdangerw Nov 21, 2023

nkovela1 Nov 21, 2023

mattdangerw Nov 21, 2023

nkovela1 Nov 21, 2023

mattdangerw Nov 21, 2023

nkovela1 Nov 21, 2023

mattdangerw left a comment

Convert SentencePieceTokenizer and associated models to new assets paradigm #1323

Convert SentencePieceTokenizer and associated models to new assets paradigm #1323

Conversation

nkovela1 commented Nov 21, 2023

mattdangerw left a comment

Choose a reason for hiding this comment

mattdangerw Nov 21, 2023

Choose a reason for hiding this comment

nkovela1 Nov 21, 2023

Choose a reason for hiding this comment

mattdangerw Nov 21, 2023

Choose a reason for hiding this comment

nkovela1 Nov 21, 2023

Choose a reason for hiding this comment

mattdangerw Nov 21, 2023

Choose a reason for hiding this comment

nkovela1 Nov 21, 2023

Choose a reason for hiding this comment

mattdangerw left a comment

Choose a reason for hiding this comment