Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

Closed
sinamoeini opened this issue Nov 9, 2021 · 5 comments
Closed

Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

sinamoeini opened this issue Nov 9, 2021 · 5 comments

Comments

@sinamoeini
Copy link

Hi TensorFlow team,

Is there going to be a byte level bpe tokenizer in tensorflow text?

@thuang513
Copy link
Member

BPE is already supported by SentencePiece. If you have a sentence piece model, you can use it with the sentencepiece op.

@sinamoeini
Copy link
Author

Thank you @thuang513. So sentencepiece and byte-level BPE only differ in training phase, right? So if I have a trained byte-level BPE I should be able to use sentencepiece and just give it the vocab and merges.

@thuang513
Copy link
Member

I believe so. If you trained the model using SentencePiece you should be able to use it as is.

@sinamoeini
Copy link
Author

I dug a bit more and it seems that there is a small difference in how they treat spaces. I will try to replicate hugging face RoBERTa tokenizer using tensorflow sentence piece tokenizer and update this thread

@MarkusSagen
Copy link

@sinamoeini Any updates on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants