[`Tokenizer Serialization`] Fix the broken serialisation #27099

ArthurZucker · 2023-10-27T07:32:24Z

What does this PR do?

Should fix some serialization issues, mostly save_pretrained with all the init kwargs, and from_pretrained with dicts.
fixes #26732

With main:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("EleutherAI/Llemma_7b", use_fast=False)
File ~/Work/transformers/src/transformers/tokenization_utils_base.py:2253, in PreTrainedTokenizerBase._from_pretrained(cls, resolved_vocab_files, pretrained_model_name_or_path, init_configuration, token, cache_dir, local_files_only, _commit_hash, _is_local, *init_inputs, **kwargs)
   2251     if added_tokens_map != {} and init_kwargs[key] is not None:
   2252         if key != "additional_special_tokens":
-> 2253             init_kwargs[key] = added_tokens_map.get(init_kwargs[key], init_kwargs[key])
   2255 init_kwargs["added_tokens_decoder"] = added_tokens_decoder
   2256 # convert {'__type': 'AddedToken', 'content': '<ent>', 'lstrip': False, 'normalized': True, ...} to AddedTokens

TypeError: unhashable type: 'dict'

This is because the tokenizer had special tokens saved as dicts, and the call to convert_added_tokens. is made after this.

HuggingFaceDocBuilderDev · 2023-10-27T07:57:17Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

LysandreJik

Thank you @ArthurZucker

ArthurZucker · 2023-11-24T16:02:57Z

Pegasus is the only slow failure I witnessed so checking this now before merging!

ArthurZucker · 2023-12-07T10:11:43Z

Ok, the issue is that when we force the added tokens encoder in the slow tokenizer, the fast of course can't do this. So the eos token gets replaced at index 0 in slow but not in fast.
Will update to force the default vocab to the default tokens.

* nits * nits * actual fix * style * ze fix * fix fix fix style

…#27099) * nits * nits * actual fix * style * ze fix * fix fix fix style

ArthurZucker force-pushed the t5-nits branch from 6f8fe3d to 08b18df Compare October 27, 2023 07:34

nits

72e9bd2

ArthurZucker force-pushed the t5-nits branch from 08b18df to 72e9bd2 Compare October 27, 2023 07:35

ArthurZucker mentioned this pull request Nov 21, 2023

Tokenizer loading: This breaks quite a few things in a lot of places #27566

Closed

ArthurZucker added 4 commits November 23, 2023 14:08

Merge branch 'main' of github.com:huggingface/transformers into t5-nits

941e7a4

nits

c763dd8

actual fix

8cf5e2c

style

5885bd9

ArthurZucker marked this pull request as ready for review November 23, 2023 13:37

ArthurZucker requested a review from LysandreJik November 23, 2023 13:38

LysandreJik approved these changes Nov 23, 2023

View reviewed changes

Merge branch 'main' of github.com:huggingface/transformers into t5-nits

5dd5894

ArthurZucker added 2 commits December 12, 2023 18:35

ze fix

2cc24d1

fix fix fix style

18cbaf1

ArthurZucker merged commit 230ac35 into huggingface:main Dec 13, 2023
21 checks passed

ArthurZucker added a commit that referenced this pull request Dec 14, 2023

[Tokenizer Serialization] Fix the broken serialisation (#27099)

a5ee6f0

* nits * nits * actual fix * style * ze fix * fix fix fix style

iantbutler01 pushed a commit to BismuthCloud/transformers that referenced this pull request Dec 16, 2023

[Tokenizer Serialization] Fix the broken serialisation (huggingface…

5215d6b

…#27099) * nits * nits * actual fix * style * ze fix * fix fix fix style

staghado pushed a commit to staghado/transformers that referenced this pull request Jan 15, 2024

[Tokenizer Serialization] Fix the broken serialisation (huggingface…

9164587

…#27099) * nits * nits * actual fix * style * ze fix * fix fix fix style

g-karthik mentioned this pull request Mar 25, 2024

transformers version when fine-tuning FlagOpen/FlagEmbedding#609

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`Tokenizer Serialization`] Fix the broken serialisation #27099

[`Tokenizer Serialization`] Fix the broken serialisation #27099

ArthurZucker commented Oct 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 27, 2023

LysandreJik left a comment

ArthurZucker commented Nov 24, 2023

ArthurZucker commented Dec 7, 2023

[Tokenizer Serialization] Fix the broken serialisation #27099

[Tokenizer Serialization] Fix the broken serialisation #27099

Conversation

ArthurZucker commented Oct 27, 2023 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Oct 27, 2023

LysandreJik left a comment

Choose a reason for hiding this comment

ArthurZucker commented Nov 24, 2023

ArthurZucker commented Dec 7, 2023

[`Tokenizer Serialization`] Fix the broken serialisation #27099

[`Tokenizer Serialization`] Fix the broken serialisation #27099

ArthurZucker commented Oct 27, 2023 •

edited

Loading