Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoTokenizer.from_pretrained with add_special_token=True cannot be deserialized back #34557

Closed
2 of 4 tasks
lfoppiano opened this issue Nov 1, 2024 · 4 comments
Closed
2 of 4 tasks
Labels

Comments

@lfoppiano
Copy link

lfoppiano commented Nov 1, 2024

System Info

the problem is present from version transformers==4.37.2 and also on the latest, 4.46.1

I initially though it was related to #31233 but this PR did not solve it, and #33453 also seems related..

Who can help?

@ArthurZucker @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. t = AutoTokenizer.from_pretrained("FacebookAI/roberta-base", add_special_tokens=True, max_length=512, add_prefix_space=True)
  2. t.save_pretrained("~/Downloads/")

Expected behavior

that it does not throw TypeError: Object of type method is not JSON serializable.

Full stack:

Traceback (most recent call last):
  File "/Users/lfoppiano/Applications/PyCharm Professional Edition.app/Contents/plugins/python-ce/helpers/pydev/pydevconsole.py", line 364, in runcode
    coro = func()
  File "<input>", line 1, in <module>
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2431, in save_pretrained
    idx = serialized_tokens.pop("id")
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 201, in encode
    chunks = list(chunks)
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 431, in _iterencode
    yield from _iterencode_dict(o, _current_indent_level)
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 405, in _iterencode_dict
    yield from chunks
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 438, in _iterencode
    o = _default(o)
  File "/Users/lfoppiano/anaconda3/envs/delft2/lib/python3.8/json/encoder.py", line 179, in default
    raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type method is not JSON serializable
@lfoppiano lfoppiano added the bug label Nov 1, 2024
@Rocketknight1
Copy link
Member

Hi @lfoppiano, I believe the problem is that setting add_special_tokens at init time is not supported because it clashes with the add_special_tokens method. When I run this code on main, I get:

AttributeError: add_special_tokens conflicts with the method add_special_tokens in RobertaTokenizerFast

This check was introduced in #31233 as you mentioned, so I'm not sure why you didn't get that error. Can you try without add_special_tokens in the init?

@lfoppiano
Copy link
Author

Hi @Rocketknight1, thanks for your quick answer. If i don't specifcy the parameter it works, however, what should I do to make sure the tokenization is the same as before? In previous versions add_special_tokens was passed as a flag at init and it was working fine.

@Rocketknight1
Copy link
Member

Hi @lfoppiano, you can pass add_special_tokens when calling the tokenizer instead. However, the default value is True, so you only need to do that when you want to set add_special_tokens=False!

@lfoppiano
Copy link
Author

Thanks @Rocketknight1. It was very quick and helpful. We can close this issue. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants