-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
T5 tokenizer adds whitespace after added token #26318
Comments
Hey, this is the same as #25881 , the fix to |
Thank you! |
Linked the fix PR 😉 |
The PR is in a good state, should be mergeable this week. It uncovers more "inconsistencies" with slow and fast, but I'll document all of this there! 😉 You can already do something like: from tokenizers.pre_tokenizers import Metaspace
.... # tokenizer.from_pretrained etc
tokenizer._tokenizer.pre_tokenizer = Metaspace(add_prefix_space = True, replacement='▁', prepend_scheme = "first") |
@ArthurZucker Even after following the step in your previous comment, it still seems to be producing incorrect output for certain inputs: from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('t5-base', use_fast=True)
print(tok.encode("</s>test</s>", add_special_tokens=False)) # Broken
from tokenizers.pre_tokenizers import Metaspace
tok._tokenizer.pre_tokenizer = Metaspace(add_prefix_space = True, replacement='▁', prepend_scheme = "first")
print(tok.encode("</s>test</s>", add_special_tokens=False)) # Should be fixed, but isn't In both cases, from transformers import AutoTokenizer
slow = AutoTokenizer.from_pretrained('t5-base', use_fast=False, legacy=False)
print(slow.encode("</s>test</s>", add_special_tokens=False)) # [1, 4377, 1] I've also tested saving and loading the tokenizer again (see here), but that has the same problem. I'm using It is worth noting that it does fix other problems, like
|
Indeed. That's a different issue which also comes from the |
Also note that the template processors usually use this:
with a prefix space before the sequence. |
Even with |
System Info
transformers
version: 4.33.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi,
When adding a token to the T5 tokenizer and then tokenizing a string, it seems that the encoding step is inserting an unwanted space after the added token.
It's unclear why the model is using the token
"▁body"
when"body"
is also in the vocabulary? And even if"body"
weren't in the vocabulary, I'd still expectconvert_ids_to_tokens
to give back something like["<", "b", "o", "d", "y", ">", "</s>"]
.Expected behavior
The following script should print
<body></s>
.I saw #24565 but this doesn't seem to have solved it for this case?
The text was updated successfully, but these errors were encountered: