Refactor slow sentencepiece tokenizers. #11716

PhilipMay · 2021-05-13T13:25:45Z

ToDo

PhilipMay · 2021-05-14T05:12:12Z

SentencePieceProcessor.decode is doing "the same but more than SentencePieceProcessor.decode_pieces.
That is why we replace SentencePieceProcessor.decode_pieces with SentencePieceProcessor.decode in this PR.

See here:

https://github.com/google/sentencepiece/blob/6256ef243844e5848499cf519eb2a7e2755e75a1/python/src/sentencepiece/__init__.py#L307

PhilipMay · 2021-05-16T18:32:38Z

rebased on upstrem/master

PhilipMay · 2021-05-22T12:23:25Z

We need to rebase on master after PR #11737 has been merged.

PhilipMay · 2021-05-26T13:47:36Z

Rebased on master - CI is green again. :-)

PhilipMay · 2021-06-01T16:03:59Z

Rebased on master to get integration tests - see #11737

PhilipMay · 2021-06-16T06:17:49Z

Rebased on master

PhilipMay · 2021-06-17T15:00:19Z

src/transformers/tokenization_utils.py

+        """
+        return self.sp_model.encode(text, out_type=str)
+
+    def _tokenize_special(self, text: str) -> List[str]:


Hey @LysandreJik - I would like to hear your feedback about this function.
Is it cool to refactor it into the base class? Or is it overengineered?

Thanks
Philip

I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.

LysandreJik

This is an interesting proposal! I'm not sure I understand everything, so I'm asking a few questions :)

LysandreJik · 2021-06-21T09:04:42Z

src/transformers/tokenization_utils.py

+        """
+        return self.sp_model.encode(text, out_type=str)
+
+    def _tokenize_special(self, text: str) -> List[str]:


I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.

LysandreJik · 2021-06-21T09:06:08Z

src/transformers/tokenization_utils.py

@@ -770,3 +772,172 @@ def _decode(
            return clean_text
        else:
            return text
+
+
+class PreTrainedSentencepieceTokenizer(PreTrainedTokenizer):


I'm not too keen on having an additional abstraction layer PreTrainedSentencepieceTokenizer.

I thought the original idea was to replace instances of "".join(tokens).replace(SPIECE_UNDERLINE, " ").strip() by self._tokenizer.decode(tokens) .

Could you explain why the changes proposed here are necessary?

Could you explain why the changes proposed here are necessary?

Well. While doing this refactoring I noticed a lot of duplicate code in the tokenizers implementations.
Since wet code is hard to maintain I tried to refactor it.

But if you do not like my refactoring I can just do the single change and that's it.

PhilipMay · 2021-06-23T21:34:39Z

I think generally speaking we'd like to have methods that are common to all tokenizers in the base class - but not methods that are common to some of them only. I'd also like to keep the number of abstraction layers to a minimum, tokenizers are already quite tough to understand.

@LysandreJik
Yes. I also prefer a low number of abstraction layers. At the same time I like dry code. There is 100% duplicate code in the tokenizers impl. that has just been duplicated by copy & paste. IMO that should be removed by an refactoring. That is what I try to introduce here.

LysandreJik · 2021-06-28T07:51:03Z

The general approach of the library is to keep the number of abstractions as low as possible, and to keep implementations as separate as possible from each other, hence the high amount of copy-pasted code.

We want users to be able to experiment with single models/tokenizers without their changes impacting other models or tokenizers - and we want them to be able to understand how a model or tokenizer behaves by simply checking a single file, rather than having to hop around multiple files.

We are failing at this with tokenizers as there are already two levels of abstraction, but adding a third one isn't really the direction we want to head to :)

Does that make sense?

PhilipMay · 2021-06-28T08:01:17Z

Does that make sense?

Yes. Sure. Your project, your call.
I will revert my changes and keep it as simple as possible as discussed in the beginning.

PhilipMay · 2021-07-19T16:13:48Z

@LysandreJik I have redone the PR. Everything is green and the changes are as simple as planned in the issue.
This is ready for review.

Averything is tested by setting test_sentencepiece = True in the tokenizer test classes and by the following
testfunction: TokenizerTesterMixin.test_sentencepiece_tokenize_and_convert_tokens_to_string

LysandreJik

Thank you @PhilipMay, LGTM!

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from acf704f to b176ef5 Compare May 16, 2021 18:32

PhilipMay changed the title ~~[WIP] Delegate to sentencepiece.decode to implement convert_tokens_to_string.~~ [WIP] Refactor slow sentencepiece tokenizers. May 16, 2021

PhilipMay mentioned this pull request May 16, 2021

Add regression tests for slow sentencepiece tokenizers. #11737

Merged

PhilipMay mentioned this pull request May 23, 2021

Strange implementation of convert_tokens_to_string in albert tokenizer. #11646

Closed

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from f121d18 to 8dc9db6 Compare May 26, 2021 10:05

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from 8dc9db6 to 4000bed Compare June 1, 2021 16:03

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from 4000bed to 9a06625 Compare June 1, 2021 19:10

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch from 9a06625 to 53c9e49 Compare June 16, 2021 06:17

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch 3 times, most recently from 622bf36 to 5170463 Compare June 16, 2021 19:10

PhilipMay commented Jun 17, 2021

View reviewed changes

LysandreJik reviewed Jun 21, 2021

View reviewed changes

PhilipMay force-pushed the improve_sentencepiece_decode_delegate branch 5 times, most recently from 9a4d6fa to 960a76f Compare July 19, 2021 15:52

fix convert_tokens_to_string calls

960a76f

PhilipMay changed the title ~~[WIP] Refactor slow sentencepiece tokenizers.~~ Refactor slow sentencepiece tokenizers. Jul 19, 2021

PhilipMay requested a review from LysandreJik July 19, 2021 16:14

LysandreJik approved these changes Jul 21, 2021

View reviewed changes

LysandreJik merged commit 15d19ec into huggingface:master Jul 21, 2021

NielsRogge mentioned this pull request Jan 3, 2022

AlbertTokenizer doesn't decode special tokens properly #15003

Closed

beneyal mentioned this pull request Feb 22, 2022

🚨 🚨 🚨 Fix Issue 15003: SentencePiece Tokenizers Not Adding Special Tokens in convert_tokens_to_string #15775

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor slow sentencepiece tokenizers. #11716

Refactor slow sentencepiece tokenizers. #11716

PhilipMay commented May 13, 2021 •

edited

Loading

PhilipMay commented May 14, 2021

PhilipMay commented May 16, 2021

PhilipMay commented May 22, 2021

PhilipMay commented May 26, 2021

PhilipMay commented Jun 1, 2021

PhilipMay commented Jun 16, 2021

PhilipMay Jun 17, 2021

LysandreJik Jun 21, 2021

LysandreJik left a comment

LysandreJik Jun 21, 2021

LysandreJik Jun 21, 2021

PhilipMay Jun 23, 2021

PhilipMay commented Jun 23, 2021 •

edited

Loading

LysandreJik commented Jun 28, 2021

PhilipMay commented Jun 28, 2021 •

edited

Loading

PhilipMay commented Jul 19, 2021

LysandreJik left a comment

Refactor slow sentencepiece tokenizers. #11716

Refactor slow sentencepiece tokenizers. #11716

Conversation

PhilipMay commented May 13, 2021 • edited Loading

ToDo

PhilipMay commented May 14, 2021

PhilipMay commented May 16, 2021

PhilipMay commented May 22, 2021

PhilipMay commented May 26, 2021

PhilipMay commented Jun 1, 2021

PhilipMay commented Jun 16, 2021

PhilipMay Jun 17, 2021

Choose a reason for hiding this comment

LysandreJik Jun 21, 2021

Choose a reason for hiding this comment

LysandreJik left a comment

Choose a reason for hiding this comment

LysandreJik Jun 21, 2021

Choose a reason for hiding this comment

LysandreJik Jun 21, 2021

Choose a reason for hiding this comment

PhilipMay Jun 23, 2021

Choose a reason for hiding this comment

PhilipMay commented Jun 23, 2021 • edited Loading

LysandreJik commented Jun 28, 2021

PhilipMay commented Jun 28, 2021 • edited Loading

PhilipMay commented Jul 19, 2021

LysandreJik left a comment

Choose a reason for hiding this comment

PhilipMay commented May 13, 2021 •

edited

Loading

PhilipMay commented Jun 23, 2021 •

edited

Loading

PhilipMay commented Jun 28, 2021 •

edited

Loading