AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back #34557

lfoppiano · 2024-11-01T08:55:39Z

System Info

the problem is present from version transformers==4.37.2 and also on the latest, 4.46.1

I initially though it was related to #31233 but this PR did not solve it, and #33453 also seems related..

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

t = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_special_tokens=True, max_length=512, add_prefix_space=True)
t.save_pretrained("~/Downloads/")

Expected behavior

that it does not throw TypeError: Object of type method is not JSON serializable.

Full stack:

Traceback (most recent call last):
  File "/Users/lfoppiano/Applications/PyCharm Professional Edition.app/Contents/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2431, in save_pretrained
    idx = serialized_tokens.pop("id")
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type method is not JSON serializable

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2024-11-01T13:09:20Z

Hi @lfoppiano, I believe the problem is that setting add_special_tokens at init time is not supported because it clashes with the add_special_tokens method. When I run this code on main, I get:

AttributeError: add_special_tokens conflicts with the method add_special_tokens in RobertaTokenizerFast

This check was introduced in #31233 as you mentioned, so I'm not sure why you didn't get that error. Can you try without add_special_tokens in the init?

lfoppiano · 2024-11-01T14:05:56Z

Hi @Rocketknight1, thanks for your quick answer. If i don't specifcy the parameter it works, however, what should I do to make sure the tokenization is the same as before? In previous versions add_special_tokens was passed as a flag at init and it was working fine.

Rocketknight1 · 2024-11-01T15:58:37Z

Hi @lfoppiano, you can pass add_special_tokens when calling the tokenizer instead. However, the default value is True, so you only need to do that when you want to set add_special_tokens=False!

lfoppiano · 2024-11-01T16:00:25Z

Thanks @Rocketknight1. It was very quick and helpful. We can close this issue. 😄

lfoppiano added the bug label Nov 1, 2024

lfoppiano closed this as completed Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back #34557

AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back #34557

lfoppiano commented Nov 1, 2024 •

edited

Loading

Rocketknight1 commented Nov 1, 2024

lfoppiano commented Nov 1, 2024

Rocketknight1 commented Nov 1, 2024

lfoppiano commented Nov 1, 2024

AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back #34557

AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back #34557

Comments

lfoppiano commented Nov 1, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Nov 1, 2024

lfoppiano commented Nov 1, 2024

Rocketknight1 commented Nov 1, 2024

lfoppiano commented Nov 1, 2024

lfoppiano commented Nov 1, 2024 •

edited

Loading