T5 tokenizer adds whitespace after added token #26318

harshil-shah · 2023-09-21T10:56:28Z

System Info

transformers version: 4.33.2
Platform: Linux-6.2.0-33-generic-x86_64-with-glibc2.35
Python version: 3.11.5
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hi,

When adding a token to the T5 tokenizer and then tokenizing a string, it seems that the encoding step is inserting an unwanted space after the added token.

from transformers import AddedToken, T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained("google/flan-t5-small")
tokenizer.add_tokens(["<"])

print(tokenizer.encode("<body>"))  # [32100, 643, 3155, 1]
print(tokenizer.decode(tokenizer.encode("<body>")))  # < body></s>
print(tokenizer.convert_ids_to_tokens(tokenizer.encode("<body>")))  # ['<', '▁body', '>', '</s>']

It's unclear why the model is using the token "▁body" when "body" is also in the vocabulary? And even if "body" weren't in the vocabulary, I'd still expect convert_ids_to_tokens to give back something like ["<", "b", "o", "d", "y", ">", "</s>"].

Expected behavior

The following script should print <body></s>.

from transformers import AddedToken, T5TokenizerFast

tokenizer = T5TokenizerFast.from_pretrained("google/flan-t5-small")
tokenizer.add_tokens(["<"])

print(tokenizer.decode(tokenizer.encode("<body>")))

I saw #24565 but this doesn't seem to have solved it for this case?

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-09-21T23:59:41Z

Hey, this is the same as #25881 , the fix to rust has not been done yet and is more involved. I'll try to get to it!

harshil-shah · 2023-09-22T14:56:45Z

Thank you!

ArthurZucker · 2023-10-23T09:58:15Z

Linked the fix PR 😉

ArthurZucker · 2023-11-20T14:50:43Z

The PR is in a good state, should be mergeable this week. It uncovers more "inconsistencies" with slow and fast, but I'll document all of this there! 😉 You can already do something like:

from tokenizers.pre_tokenizers import Metaspace
.... # tokenizer.from_pretrained etc
tokenizer._tokenizer.pre_tokenizer = Metaspace(add_prefix_space = True, replacement='▁', prepend_scheme = "first")

xenova · 2023-12-17T21:26:29Z

@ArthurZucker Even after following the step in your previous comment, it still seems to be producing incorrect output for certain inputs:

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('t5-base', use_fast=True)
print(tok.encode("</s>test</s>", add_special_tokens=False)) # Broken

from tokenizers.pre_tokenizers import Metaspace
tok._tokenizer.pre_tokenizer = Metaspace(add_prefix_space = True, replacement='▁', prepend_scheme = "first")
print(tok.encode("</s>test</s>", add_special_tokens=False)) # Should be fixed, but isn't

In both cases, [1, 794, 1] is printed which corresponds to ['</s>', '▁test', '</s>']... but it should be [1, 4377, 1] which corresponds to ['</s>', 'test', '</s>']. This can be achieved with the slow tokenizer with legacy set to false:

from transformers import AutoTokenizer
slow = AutoTokenizer.from_pretrained('t5-base', use_fast=False, legacy=False)
print(slow.encode("</s>test</s>", add_special_tokens=False)) # [1, 4377, 1]

I've also tested saving and loading the tokenizer again (see here), but that has the same problem. I'm using tokenizers==0.15.0 and transformers==4.36.1 (latest).

It is worth noting that it does fix other problems, like "Hey </s>. how are you":

Old (incorrect): ['▁Hey', '▁', '</s>', '▁', '.', '▁how', '▁are', '▁you']
New (correct): ['▁Hey', '▁', '</s>', '.', '▁how', '▁are', '▁you']

ArthurZucker · 2023-12-18T08:46:36Z

Indeed. That's a different issue which also comes from the extract_and_normalize piece of code. I'll see if there is a quick fix thanks for reporting

ArthurZucker · 2023-12-18T09:05:35Z

Also note that the template processors usually use this:

transformers/src/transformers/models/llama/tokenization_llama_fast.py

Line 160 in e6dcf8a

    
           single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"

with a prefix space before the sequence.

xenova · 2023-12-18T11:44:07Z

Also note that the template processors usually use this:

transformers/src/transformers/models/llama/tokenization_llama_fast.py

Line 160 in e6dcf8a

single = f"{(bos+':0 ') if self.add_bos_token else ''}$A:0{(' '+eos+':0') if self.add_eos_token else ''}"

with a prefix space before the sequence.

Even with add_special_tokens=False? 👀

ArthurZucker self-assigned this Sep 21, 2023

ArthurZucker mentioned this issue Sep 28, 2023

Inconsistency between fast and slow codellama tokenizers #26455

Closed

4 tasks

huggingface deleted a comment from github-actions bot Oct 23, 2023

ArthurZucker mentioned this issue Oct 23, 2023

[Core Tokenization] Support a fix for spm fast models #26678

Merged

3 tasks

ArthurZucker mentioned this issue Nov 6, 2023

Inconsistent results between LlamaTokenizer and LlamaTokenizerFast #27230

Closed

4 tasks

huggingface deleted a comment from github-actions bot Nov 17, 2023

huggingface deleted a comment from github-actions bot Dec 16, 2023

ArthurZucker mentioned this issue Jan 2, 2024

Tokenizer adds an additional space after the added token #28218

Open

4 tasks

huggingface deleted a comment from github-actions bot Jan 12, 2024

ArthurZucker closed this as completed in #26678 Jan 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5 tokenizer adds whitespace after added token #26318

T5 tokenizer adds whitespace after added token #26318

harshil-shah commented Sep 21, 2023 •

edited

Loading

ArthurZucker commented Sep 21, 2023

harshil-shah commented Sep 22, 2023

ArthurZucker commented Oct 23, 2023

ArthurZucker commented Nov 20, 2023

xenova commented Dec 17, 2023 •

edited

Loading

ArthurZucker commented Dec 18, 2023

ArthurZucker commented Dec 18, 2023

xenova commented Dec 18, 2023

T5 tokenizer adds whitespace after added token #26318

T5 tokenizer adds whitespace after added token #26318

Comments

harshil-shah commented Sep 21, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Sep 21, 2023

harshil-shah commented Sep 22, 2023

ArthurZucker commented Oct 23, 2023

ArthurZucker commented Nov 20, 2023

xenova commented Dec 17, 2023 • edited Loading

ArthurZucker commented Dec 18, 2023

ArthurZucker commented Dec 18, 2023

xenova commented Dec 18, 2023

harshil-shah commented Sep 21, 2023 •

edited

Loading

xenova commented Dec 17, 2023 •

edited

Loading