-
Notifications
You must be signed in to change notification settings - Fork 347
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763
Comments
BPE is already supported by SentencePiece. If you have a sentence piece model, you can use it with the sentencepiece op. |
Thank you @thuang513. So sentencepiece and byte-level BPE only differ in training phase, right? So if I have a trained byte-level BPE I should be able to use sentencepiece and just give it the vocab and merges. |
I believe so. If you trained the model using SentencePiece you should be able to use it as is. |
I dug a bit more and it seems that there is a small difference in how they treat spaces. I will try to replicate hugging face RoBERTa tokenizer using tensorflow sentence piece tokenizer and update this thread |
@sinamoeini Any updates on this? |
Hi TensorFlow team,
Is there going to be a byte level bpe tokenizer in tensorflow text?
The text was updated successfully, but these errors were encountered: