Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert SentencePieceTokenizer and associated models to new assets paradigm #1323

Merged
merged 11 commits into from
Nov 21, 2023

Conversation

nkovela1
Copy link
Collaborator

This PR converts the SentencePieceTokenizer to use save_assets/load_assets logic and converts the following models to defer associated vocabulary logic:

  • Albert
  • DebertaV3
  • FNet
  • XLMRoberta
  • T5

@nkovela1 nkovela1 requested a review from mattdangerw November 21, 2023 07:07
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Found a few things.

def set_vocabulary(self, proto):
super().set_vocabulary(proto)
if proto is not None:
for token in [self.pad_token]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a prexisting bug that self.end_token was not included here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this is likely a preexisting bug. Fixed!

path = os.path.join(dir_path, VOCAB_FILENAME)
self.set_vocabulary(path)

def set_vocabulary(self, proto):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just call this set_proto. A little weird, but matches the init arg naming we have today. And this should not be set on the results of get_vocabulary, that would blow up.

word_piece_tokenizer.set_vocabulary
sentence_piece_tokenizer.set_proto
byte_pair_tokenizer.set_vocabulary_and_merges

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I've switched them to set_proto throughout. Thanks!

def save_assets(self, dir_path):
path = os.path.join(dir_path, VOCAB_FILENAME)
with open(path, "w") as file:
for token in self.proto:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't work! We need to write the proto to a file. Not sure how best to do this but we might want to spy on https://github.com/google/sentencepiece

I should also figure out how to enable our large testing on this branch, which includes saving. That would catch this issue.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch!

Fixed, I could simply write self.proto to the file directly (since I have set this as a bytes string), and load_assets will simply pass the filepath to set_proto (which has pre-existing capability to handle proto loading from a filepath)

I think our next PR can focus on e2e testing for the save/load assets workflow.

@nkovela1 nkovela1 requested a review from mattdangerw November 21, 2023 19:17
Copy link
Member

@mattdangerw mattdangerw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! One last potential hiccup.

@nkovela1 nkovela1 merged commit 4ca2516 into keras-team:kaggle Nov 21, 2023
mattdangerw pushed a commit that referenced this pull request Nov 29, 2023
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
mattdangerw pushed a commit that referenced this pull request Dec 7, 2023
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
mattdangerw pushed a commit that referenced this pull request Dec 7, 2023
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
mattdangerw pushed a commit that referenced this pull request Dec 8, 2023
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
sampathweb pushed a commit that referenced this pull request Dec 12, 2023
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
mattdangerw pushed a commit that referenced this pull request Jan 4, 2024
…radigm (#1323)

* Convert SentencePiece tokenizer to save_assets/load_assets

* Convert albert to new assets paradigm

* Convert DebertaV3 to new assets paradigm

* Fix formatting issues

* Convert FNet to new assets paradigm

* Convert XLMRoberta to new assets paradigm

* Convert T5 Tokenizer to new assets paradigm

* Fix sentencepiece tokenizer config test

* Change set_vocabulary to set_proto

* Change proto to raw proto

* Change to proto_bytes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants