Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Models with a sentencepiece tokenizers have problems with special tokens and encode decode #28714

Closed
1 of 4 tasks
ekgren opened this issue Jan 25, 2024 · 2 comments
Closed
1 of 4 tasks

Comments

@ekgren
Copy link
Contributor

ekgren commented Jan 25, 2024

System Info

  • transformers version: 4.35.2
  • Platform: Linux-6.1.58+-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.1
  • Accelerate version: not installed
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.0+cu121 (False)
  • Tensorflow version (GPU?): 2.15.0 (False)
  • Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
  • Jax version: 0.4.23
  • JaxLib version: 0.4.23
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

https://colab.research.google.com/drive/1vujbKaRkIpk7qli7eUKAZQDRksHSRW51?usp=sharing

Expected behavior

Huggingface tokenizers with sentencepiece in the back have inconsistent encoding decoding behaviour. If you encode and decode a string with special characters white spaces are inserted.

Expected behaviour would be to get the exact same string back.

This is both present with the Llama2 tokenizer, the gpt-sw3 tokenizers and more

@ArthurZucker
Copy link
Collaborator

#26678 fixed this, I can't push everything to the hub but Llama tokenizer will have a fix soon.
This is a duplicate of #26455

@ekgren
Copy link
Contributor Author

ekgren commented Jan 29, 2024

Thank you for all the hard work @ArthurZucker, closing this issue then!

@ekgren ekgren closed this as completed Jan 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants