You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Huggingface tokenizers with sentencepiece in the back have inconsistent encoding decoding behaviour. If you encode and decode a string with special characters white spaces are inserted.
Expected behaviour would be to get the exact same string back.
This is both present with the Llama2 tokenizer, the gpt-sw3 tokenizers and more
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://colab.research.google.com/drive/1vujbKaRkIpk7qli7eUKAZQDRksHSRW51?usp=sharing
Expected behavior
Huggingface tokenizers with sentencepiece in the back have inconsistent encoding decoding behaviour. If you encode and decode a string with special characters white spaces are inserted.
Expected behaviour would be to get the exact same string back.
This is both present with the Llama2 tokenizer, the gpt-sw3 tokenizers and more
The text was updated successfully, but these errors were encountered: