Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor slow sentencepiece tokenizers. #11716

Conversation

PhilipMay
Copy link
Contributor

@PhilipMay PhilipMay commented May 13, 2021

PR for #11646

ToDo

  • AlbertTokenizer
  • BarthezTokenizer
  • BertGenerationTokenizer
  • BigBirdTokenizer
  • CamembertTokenizer
  • DebertaV2Tokenizer
  • M2M100Tokenizer
  • MarianTokenizer
  • MBart50Tokenizer
  • PegasusTokenizer
  • ReformerTokenizer
  • Speech2TextTokenizer
  • T5Tokenizer
  • XLMProphetNetTokenizer
  • XLM RoBERTa
  • XLNetTokenizer

@PhilipMay
Copy link
Contributor Author

SentencePieceProcessor.decode is doing "the same but more than SentencePieceProcessor.decode_pieces.
That is why we replace SentencePieceProcessor.decode_pieces with SentencePieceProcessor.decode in this PR.

See here:

https://github.com/google/sentencepiece/blob/6256ef243844e5848499cf519eb2a7e2755e75a1/python/src/sentencepiece/__init__.py#L307

@PhilipMay PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from acf704f to b176ef5 Compare May 16, 2021 18:32
@PhilipMay
Copy link
Contributor Author

rebased on upstrem/master

@PhilipMay PhilipMay changed the title [WIP] Delegate to sentencepiece.decode to implement convert_tokens_to_string. [WIP] Refactor slow sentencepiece tokenizers. May 16, 2021
@PhilipMay
Copy link
Contributor Author

We need to rebase on master after PR #11737 has been merged.

@PhilipMay
Copy link
Contributor Author

Rebased on master - CI is green again. :-)

@PhilipMay PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from 8dc9db6 to 4000bed Compare June 1, 2021 16:03
@PhilipMay
Copy link
Contributor Author

Rebased on master to get integration tests - see #11737

@PhilipMay PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from 4000bed to 9a06625 Compare June 1, 2021 19:10
@PhilipMay PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from 9a06625 to 53c9e49 Compare June 16, 2021 06:17
@PhilipMay
Copy link
Contributor Author

Rebased on master

@PhilipMay PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch 3 times, most recently from 622bf36 to 5170463 Compare June 16, 2021 19:10
"""
return self.sp_model.encode(text, out_type=str)

def _tokenize_special(self, text: str) -> List[str]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @LysandreJik - I would like to hear your feedback about this function.
Is it cool to refactor it into the base class? Or is it overengineered?

Thanks
Philip

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.

Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting proposal! I'm not sure I understand everything, so I'm asking a few questions :)

"""
return self.sp_model.encode(text, out_type=str)

def _tokenize_special(self, text: str) -> List[str]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.

@@ -770,3 +772,172 @@ def _decode(
return clean_text
else:
return text


class PreTrainedSentencepieceTokenizer(PreTrainedTokenizer):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too keen on having an additional abstraction layer PreTrainedSentencepieceTokenizer.

I thought the original idea was to replace instances of "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() by self._tokenizer.decode(tokens) .

Could you explain why the changes proposed here are necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain why the changes proposed here are necessary?

Well. While doing this refactoring I noticed a lot of duplicate code in the tokenizers implementations.
Since wet code is hard to maintain I tried to refactor it.

But if you do not like my refactoring I can just do the single change and that's it.

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Jun 23, 2021

I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.

@LysandreJik
Yes. I also prefer a low number of abstraction layers. At the same time I like dry code. There is 100% duplicate code in the tokenizers impl. that has just been duplicated by copy & paste. IMO that should be removed by an refactoring. That is what I try to introduce here.

@LysandreJik
Copy link
Member

The general approach of the library is to keep the number of abstractions as low as possible, and to keep implementations as separate as possible from each other, hence the high amount of copy-pasted code.

We want users to be able to experiment with single models/tokenizers without their changes impacting other models or tokenizers - and we want them to be able to understand how a model or tokenizer behaves by simply checking a single file, rather than having to hop around multiple files.

We are failing at this with tokenizers as there are already two levels of abstraction, but adding a third one isn't really the direction we want to head to :)

Does that make sense?

@PhilipMay
Copy link
Contributor Author

PhilipMay commented Jun 28, 2021

Does that make sense?

Yes. Sure. Your project, your call.
I will revert my changes and keep it as simple as possible as discussed in the beginning.

@PhilipMay PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch 5 times, most recently from 9a4d6fa to 960a76f Compare July 19, 2021 15:52
@PhilipMay PhilipMay changed the title [WIP] Refactor slow sentencepiece tokenizers. Refactor slow sentencepiece tokenizers. Jul 19, 2021
@PhilipMay
Copy link
Contributor Author

@LysandreJik I have redone the PR. Everything is green and the changes are as simple as planned in the issue.
This is ready for review.

Averything is tested by setting test_sentencepiece = True in the tokenizer test classes and by the following
testfunction: TokenizerTesterMixin.test_sentencepiece_tokenize_and_convert_tokens_to_string

@PhilipMay PhilipMay requested a review from LysandreJik July 19, 2021 16:14
Copy link
Member

@LysandreJik LysandreJik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @PhilipMay, LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants