PreTrainedTokenizerFast with tokenizer object is acting on original tokenizer object #18406

YBooks · 2022-08-01T18:39:49Z

System Info

transformers version: 4.21.0
Platform: Windows-10-10.0.19041-SP0
Python version: 3.9.2
Huggingface_hub version: 0.8.1
PyTorch version (GPU?): 1.8.1+cpu (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@LysandreJik

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

To reproduce this error, we can create a tokenizer and try to wrap it in the PreTrainedTokenizerFast

from tokenizers import Tokenizer, models, normalizers, pre_tokenizers, trainers
data = [
    "My first sentence",
    "My second sentence",
    "My third sentence is a bit longer",
    "My fourth sentence is longer than the third one"
]

tokenizer = Tokenizer(models.WordLevel(unk_token="<unk>"))
trainer = trainers.WordLevelTrainer(vocab_size=10, special_tokens=["<unk>", "<pad>"])
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.train_from_iterator(data, trainer=trainer)

tokenizer.enable_padding(pad_token="<pad>", pad_id=tokenizer.token_to_id("<pad>"))
tokenizer.enable_truncation(max_length=5)
print(tokenizer.encode(data[-1]).ids, tokenizer.padding)

This gives an output with len 5 and an explicit padding object

In the other hand if we load our tokenizer in the PreTrainedTokenizerFast class and print the same thing like before.

from transformers import PreTrainedTokenizerFast

fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer)
fast_tokenizer(data)

print(tokenizer.encode(data[-1]).ids, tokenizer.padding)

This gives an output with len > 5 and None in padding

Expected behavior

The expected behavior should be the same with tokenizer before loading it in the PreTrainedTokenizerFast wrapper. It should not impact the padding and the truncation part

The text was updated successfully, but these errors were encountered:

LysandreJik · 2022-08-09T07:22:08Z

cc @SaulLu

SaulLu · 2022-08-09T17:13:36Z

Hi @YBooks

Thank you very much for the detailed issue 🤗 !

I see that you have already proposed a fix that has been merged and that solves the problem you are pointing out. If you are happy with it, is it ok if we close this issue?

YBooks · 2022-08-10T12:16:18Z

Hey @SaulLu
Yes sure. My pleasure

maclandrol · 2022-09-19T00:25:24Z

@YBooks , @SaulLu , @sgugger can we reopen this issue, since #18408 creates another one ?

YBooks added the bug label Aug 1, 2022

YBooks mentioned this issue Aug 1, 2022

fix: create a copy for tokenizer object #18408

Merged

YBooks closed this as completed Aug 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PreTrainedTokenizerFast with tokenizer object is acting on original tokenizer object #18406

PreTrainedTokenizerFast with tokenizer object is acting on original tokenizer object #18406

YBooks commented Aug 1, 2022

LysandreJik commented Aug 9, 2022

SaulLu commented Aug 9, 2022

YBooks commented Aug 10, 2022

maclandrol commented Sep 19, 2022 •

edited

Loading

PreTrainedTokenizerFast with tokenizer object is acting on original tokenizer object #18406

PreTrainedTokenizerFast with tokenizer object is acting on original tokenizer object #18406

Comments

YBooks commented Aug 1, 2022

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

LysandreJik commented Aug 9, 2022

SaulLu commented Aug 9, 2022

YBooks commented Aug 10, 2022

maclandrol commented Sep 19, 2022 • edited Loading

maclandrol commented Sep 19, 2022 •

edited

Loading