Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Patch BigBird tokenization test #12653

Merged
merged 1 commit into from
Jul 13, 2021
Merged

Conversation

LysandreJik
Copy link
Member

@LysandreJik LysandreJik commented Jul 12, 2021

This patches the BigBird integration test.

The core of the issue is that the [MASK] token is an AddedToken with lstrip=True. It, therefore, gobbles up the spaces on the left without getting a sentence piece underline.

Therefore, when decoding, the internal sentence piece tokenizer is unaware that it should add a space in front of the [MASK] token.

However, the original tokenizer does correctly decode with the space, so I believe there's an issue with our implementation.

@vasudevgupta7 do you know of the difference between the two implementations? Also cc @n1t0 and @SaulLu

Do not merge this as this isn't the correct fix :)

@LysandreJik LysandreJik changed the title Patch BigBird tokenization test [WIP] Patch BigBird tokenization test Jul 12, 2021
@LysandreJik LysandreJik marked this pull request as draft July 12, 2021 14:00
@thevasudevgupta
Copy link
Contributor

Hey @LysandreJik,

Even original tokenizer is not introducing space before [MASK], so I think tokenizer is alright & the test is wrong instead.

wget https://huggingface.co/google/bigbird-roberta-base/resolve/main/spiece.model
s = spm.SentencePieceProcessor(model_file='spiece.model')
s.decode([7434, 9894, 67, 9894, 7434])

@LysandreJik
Copy link
Member Author

Great, then merging this! Thanks @vasudevgupta7

@LysandreJik LysandreJik marked this pull request as ready for review July 13, 2021 06:53
@LysandreJik LysandreJik merged commit a6938c4 into master Jul 13, 2021
@LysandreJik LysandreJik deleted the patch-tokenization-test branch July 13, 2021 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants