[WIP] Patch BigBird tokenization test #12653

LysandreJik · 2021-07-12T13:58:55Z

This patches the BigBird integration test.

The core of the issue is that the [MASK] token is an AddedToken with lstrip=True. It, therefore, gobbles up the spaces on the left without getting a sentence piece underline.

Therefore, when decoding, the internal sentence piece tokenizer is unaware that it should add a space in front of the [MASK] token.

However, the original tokenizer does correctly decode with the space, so I believe there's an issue with our implementation.

@vasudevgupta7 do you know of the difference between the two implementations? Also cc @n1t0 and @SaulLu

Do not merge this as this isn't the correct fix :)

thevasudevgupta · 2021-07-12T22:46:27Z

Hey @LysandreJik,

Even original tokenizer is not introducing space before [MASK], so I think tokenizer is alright & the test is wrong instead.

wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model
s = spm.SentencePieceProcessor(model_file='spiece.model')
s.decode([7434, 9894, 67, 9894, 7434])

LysandreJik · 2021-07-13T06:52:58Z

Great, then merging this! Thanks @vasudevgupta7

Patch BigBird tokenization test

71d012f

LysandreJik changed the title ~~Patch BigBird tokenization test~~ [WIP] Patch BigBird tokenization test Jul 12, 2021

LysandreJik marked this pull request as draft July 12, 2021 14:00

LysandreJik marked this pull request as ready for review July 13, 2021 06:53

LysandreJik merged commit a6938c4 into master Jul 13, 2021

LysandreJik deleted the patch-tokenization-test branch July 13, 2021 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Patch BigBird tokenization test #12653

[WIP] Patch BigBird tokenization test #12653

LysandreJik commented Jul 12, 2021 •

edited

Loading

thevasudevgupta commented Jul 12, 2021

LysandreJik commented Jul 13, 2021

[WIP] Patch BigBird tokenization test #12653

[WIP] Patch BigBird tokenization test #12653

Conversation

LysandreJik commented Jul 12, 2021 • edited Loading

thevasudevgupta commented Jul 12, 2021

LysandreJik commented Jul 13, 2021

LysandreJik commented Jul 12, 2021 •

edited

Loading