Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

sinamoeini · 2021-11-09T17:54:11Z

Hi TensorFlow team,

Is there going to be a byte level bpe tokenizer in tensorflow text?

thuang513 · 2021-11-09T18:58:17Z

BPE is already supported by SentencePiece. If you have a sentence piece model, you can use it with the sentencepiece op.

sinamoeini · 2021-11-09T19:12:56Z

Thank you @thuang513. So sentencepiece and byte-level BPE only differ in training phase, right? So if I have a trained byte-level BPE I should be able to use sentencepiece and just give it the vocab and merges.

thuang513 · 2021-11-09T19:19:15Z

I believe so. If you trained the model using SentencePiece you should be able to use it as is.

sinamoeini · 2021-11-09T19:30:27Z

I dug a bit more and it seems that there is a small difference in how they treat spaces. I will try to replicate hugging face RoBERTa tokenizer using tensorflow sentence piece tokenizer and update this thread

MarkusSagen · 2022-04-13T09:27:05Z

@sinamoeini Any updates on this?

broken closed this as completed Dec 14, 2021

mattdangerw mentioned this issue Mar 16, 2022

Add a byte pair encoding (BPE) tokenizer layer keras-team/keras-hub#46

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

sinamoeini commented Nov 9, 2021

thuang513 commented Nov 9, 2021

sinamoeini commented Nov 9, 2021

thuang513 commented Nov 9, 2021

sinamoeini commented Nov 9, 2021

MarkusSagen commented Apr 13, 2022

Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

Byte Level BPE Tokenizer (GPT2/RoBERTa ) #763

Comments

sinamoeini commented Nov 9, 2021

thuang513 commented Nov 9, 2021

sinamoeini commented Nov 9, 2021

thuang513 commented Nov 9, 2021

sinamoeini commented Nov 9, 2021

MarkusSagen commented Apr 13, 2022