Fix + Test #8049

LysandreJik · 2020-10-26T13:21:17Z

Fix an edge case of the blenderbot-90 tokenizer.

Context

If the blenderbot-90 tokenizer is used to tokenize the following sequence:

sequence = "Ok ."

It will split it in two tokens at first:

transformers/src/transformers/tokenization_blenderbot.py

Line 221 in 8bbe824

split_tokens.extend([t for t in self.bpe(token).split(" ")])

Those two tokens will be ['Ok', '.']

The issue is that, when passed the second token, the bpe method will convert it from '.' to ' .' here:

transformers/src/transformers/tokenization_blenderbot.py

Line 160 in 8bbe824

token = re.sub("([.,!?()])", r" \1", token)

This then gets split on spaces here:

transformers/src/transformers/tokenization_blenderbot.py

Line 166 in 8bbe824

tokens = token.split(" ")

This is where the issue lies, as it creates two strings: ["", "."], the first one being empty.

It then crashes a bit further as we try to index the empty string:

transformers/src/transformers/tokenization_blenderbot.py

Line 171 in 8bbe824

word = tuple(list(word[:-1]) + [word[-1] + "</w>"])

Proposal

Ensure that the token has a length > 0 before trying to manage it, otherwise ignore that token.

Added a test.

sshleifer

Great catch!

sshleifer · 2020-10-26T15:28:09Z

src/transformers/tokenization_blenderbot.py

@@ -166,6 +166,9 @@ def bpe(self, token: str) -> str:
        tokens = token.split(" ")
        words = []
        for token in tokens:
+            if not len(token):


if not token also works

You're right!

This reverts commit 2ad8c38.

Fix + Test

dfebed4

LysandreJik requested a review from sshleifer October 26, 2020 13:21

sshleifer approved these changes Oct 26, 2020

View reviewed changes

LysandreJik merged commit cbad90d into master Oct 26, 2020

LysandreJik deleted the fix-blenderbot-90-tokenizer branch October 26, 2020 16:32

fabiocapsouza pushed a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Fix + Test (huggingface#8049)

2ad8c38

fabiocapsouza added a commit to fabiocapsouza/transformers that referenced this pull request Nov 15, 2020

Revert "Fix + Test (huggingface#8049)"

d12a3e4

This reverts commit 2ad8c38.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix + Test #8049

Fix + Test #8049

LysandreJik commented Oct 26, 2020

sshleifer left a comment

sshleifer Oct 26, 2020

LysandreJik Oct 26, 2020

Fix + Test #8049

Fix + Test #8049

Conversation

LysandreJik commented Oct 26, 2020

Context

Proposal

sshleifer left a comment

Choose a reason for hiding this comment

sshleifer Oct 26, 2020

Choose a reason for hiding this comment

LysandreJik Oct 26, 2020

Choose a reason for hiding this comment