-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert SentencePieceTokenizer and associated models to new assets paradigm #1323
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Found a few things.
keras_nlp/models/t5/t5_tokenizer.py
Outdated
def set_vocabulary(self, proto): | ||
super().set_vocabulary(proto) | ||
if proto is not None: | ||
for token in [self.pad_token]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a prexisting bug that self.end_token
was not included here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, this is likely a preexisting bug. Fixed!
path = os.path.join(dir_path, VOCAB_FILENAME) | ||
self.set_vocabulary(path) | ||
|
||
def set_vocabulary(self, proto): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just call this set_proto
. A little weird, but matches the init arg naming we have today. And this should not be set on the results of get_vocabulary
, that would blow up.
word_piece_tokenizer.set_vocabulary
sentence_piece_tokenizer.set_proto
byte_pair_tokenizer.set_vocabulary_and_merges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I've switched them to set_proto
throughout. Thanks!
def save_assets(self, dir_path): | ||
path = os.path.join(dir_path, VOCAB_FILENAME) | ||
with open(path, "w") as file: | ||
for token in self.proto: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work! We need to write the proto to a file. Not sure how best to do this but we might want to spy on https://github.com/google/sentencepiece
I should also figure out how to enable our large testing on this branch, which includes saving. That would catch this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch!
Fixed, I could simply write self.proto to the file directly (since I have set this as a bytes string), and load_assets will simply pass the filepath to set_proto
(which has pre-existing capability to handle proto loading from a filepath)
I think our next PR can focus on e2e testing for the save/load assets workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! One last potential hiccup.
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
…radigm (#1323) * Convert SentencePiece tokenizer to save_assets/load_assets * Convert albert to new assets paradigm * Convert DebertaV3 to new assets paradigm * Fix formatting issues * Convert FNet to new assets paradigm * Convert XLMRoberta to new assets paradigm * Convert T5 Tokenizer to new assets paradigm * Fix sentencepiece tokenizer config test * Change set_vocabulary to set_proto * Change proto to raw proto * Change to proto_bytes
This PR converts the SentencePieceTokenizer to use save_assets/load_assets logic and converts the following models to defer associated vocabulary logic: