-
Notifications
You must be signed in to change notification settings - Fork 27.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor slow sentencepiece tokenizers. #11716
Refactor slow sentencepiece tokenizers. #11716
Conversation
See here: |
acf704f
to
b176ef5
Compare
rebased on upstrem/master |
sentencepiece.decode
to implement convert_tokens_to_string
.
We need to rebase on master after PR #11737 has been merged. |
f121d18
to
8dc9db6
Compare
Rebased on master - CI is green again. :-) |
8dc9db6
to
4000bed
Compare
Rebased on master to get integration tests - see #11737 |
4000bed
to
9a06625
Compare
9a06625
to
53c9e49
Compare
Rebased on master |
622bf36
to
5170463
Compare
""" | ||
return self.sp_model.encode(text, out_type=str) | ||
|
||
def _tokenize_special(self, text: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @LysandreJik - I would like to hear your feedback about this function.
Is it cool to refactor it into the base class? Or is it overengineered?
Thanks
Philip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting proposal! I'm not sure I understand everything, so I'm asking a few questions :)
""" | ||
return self.sp_model.encode(text, out_type=str) | ||
|
||
def _tokenize_special(self, text: str) -> List[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.
@@ -770,3 +772,172 @@ def _decode( | |||
return clean_text | |||
else: | |||
return text | |||
|
|||
|
|||
class PreTrainedSentencepieceTokenizer(PreTrainedTokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too keen on having an additional abstraction layer PreTrainedSentencepieceTokenizer
.
I thought the original idea was to replace instances of "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip()
by self._tokenizer.decode(tokens)
.
Could you explain why the changes proposed here are necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why the changes proposed here are necessary?
Well. While doing this refactoring I noticed a lot of duplicate code in the tokenizers implementations.
Since wet code is hard to maintain I tried to refactor it.
But if you do not like my refactoring I can just do the single change and that's it.
@LysandreJik |
The general approach of the library is to keep the number of abstractions as low as possible, and to keep implementations as separate as possible from each other, hence the high amount of copy-pasted code. We want users to be able to experiment with single models/tokenizers without their changes impacting other models or tokenizers - and we want them to be able to understand how a model or tokenizer behaves by simply checking a single file, rather than having to hop around multiple files. We are failing at this with tokenizers as there are already two levels of abstraction, but adding a third one isn't really the direction we want to head to :) Does that make sense? |
Yes. Sure. Your project, your call. |
9a4d6fa
to
960a76f
Compare
@LysandreJik I have redone the PR. Everything is green and the changes are as simple as planned in the issue. Averything is tested by setting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @PhilipMay, LGTM!
PR for #11646
ToDo
AlbertTokenizer
BarthezTokenizer
BertGenerationTokenizer
BigBirdTokenizer
CamembertTokenizer
DebertaV2Tokenizer
M2M100Tokenizer
MarianTokenizer
MBart50Tokenizer
PegasusTokenizer
ReformerTokenizer
Speech2TextTokenizer
T5Tokenizer
XLMProphetNetTokenizer
XLM RoBERTa
XLNetTokenizer