Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

delgermurun · 2023-07-18T08:57:30Z

Here is the reproducible script:

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Split

# https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
t = """First Citizen:
Before we proceed any further, hear me speak.

..."""

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=1000, min_frequency=2)
tokenizer.pre_tokenizer = Split("\w+|[^\w\s]+", behavior="isolated")

tokenizer.train_from_iterator(
    iterator=[t],
    trainer=trainer,
)

tokenizer.save("tokenizer.json")

Works fine if I use trained tokenizer directly (not loading from the file)

print(tokenizer.encode("""especially       against Caius Marcius?

All:
Against""").tokens)

Output: ['es', 'p', 'ec', 'i', 'all', 'y ', ' ', ' ', ' ', ' ', ' ', ' a', 'gainst ', 'Caius Marc', 'i', 'us', '?\n\nAll:\n', 'A', 'gain', 'st']

But loading the tokenizer from the file fails.

tokenizer = Tokenizer.from_file("tokenizer.json")

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[88], line 1
----> 1 tokenizer = Tokenizer.from_file("tokenizer.json")

Exception: data did not match any variant of untagged enum ModelWrapper at line 382 column 3

Version: tokenizers==0.13.3

The text was updated successfully, but these errors were encountered:

delgermurun · 2023-07-18T08:57:48Z

#909 works for me! I'll go with this PR for now. Thank you @Narsil.

Narsil · 2023-07-18T10:44:59Z

Perfect, closing this for now.

Once the awesome model you're building get's merged in transformers we'll merge #909 to get it included !

jiaohuix · 2024-04-19T06:47:01Z

from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

it works for me~

Narsil closed this as completed Jul 18, 2023

mcognetta mentioned this issue May 25, 2024

Deserializing BPE tokenizer failure #1541

Closed

clownrat6 mentioned this issue Jul 8, 2024

🐛 [Bug] Fix mistral tokenizer bug and related issues. DAMO-NLP-SG/VideoLLaMA2#39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

delgermurun commented Jul 18, 2023

delgermurun commented Jul 18, 2023

Narsil commented Jul 18, 2023

jiaohuix commented Apr 19, 2024

Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

Comments

delgermurun commented Jul 18, 2023

But loading the tokenizer from the file fails.

delgermurun commented Jul 18, 2023

Narsil commented Jul 18, 2023

jiaohuix commented Apr 19, 2024