Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load codeqwen Tokenizer #30324

Closed
1 of 4 tasks
zch-cc opened this issue Apr 18, 2024 · 7 comments
Closed
1 of 4 tasks

Cannot load codeqwen Tokenizer #30324

zch-cc opened this issue Apr 18, 2024 · 7 comments
Labels
Core: Tokenization Internals of the library; Tokenization.

Comments

@zch-cc
Copy link

zch-cc commented Apr 18, 2024

System Info

  • transformers version: 4.40.0
  • Platform: Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • Huggingface_hub version: 0.20.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I cannot load model CodeQwen with AutoTokenizer
Code Qwen comes out this week and I would like to use it. https://huggingface.co/Qwen/CodeQwen1.5-7B
But I run into this error when loading the tokenizer

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/CodeQwen1.5-7B-Chat")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 12564 column 3

According to ggerganov/llama.cpp#6707 the difference between codeqwen and qwen1.5 is they use different tokenizer based on sentencepiece

Expected behavior

This issue maybe belong to a new model support but I think changing the tokenizer will work Successfully load the tokenizer

@amyeroberts
Copy link
Collaborator

Hi @zch-cc, thanks for reporting!

I'm able to replicate. From bisect, seems to be coming from #30289 cc @Narsil @ArthurZucker. Happens even when upgrading tokenizers.

@sdadas
Copy link

sdadas commented Apr 19, 2024

This problem also affects other models, for example https://huggingface.co/sdadas/mmlw-retrieval-roberta-large
One thing the models have in common is the use of Metaspace pre-tokenizer, so this could be the source of the error.

@amyeroberts amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Apr 19, 2024
@wertyac
Copy link

wertyac commented Apr 20, 2024

The same issue for me of CodeQwen 1.5 7B

@yungchentang
Copy link

same issue!

@fdeh75
Copy link

fdeh75 commented Apr 22, 2024

Quick and dirty solution: downgrade tokenizers

requrements.txt:

tokenizers==0.15.2
transformers<4.40.0
$ pip install -r requrements.txt

It worked for me

@ArthurZucker
Copy link
Collaborator

Mmm there seems to be something wrong with the serialization / de-serialization.
Let me check what I can do here

@ArthurZucker
Copy link
Collaborator

I think the authors fixed the format 🤗 It was probably not serialized with 0.15.2 nor 0.19.1

hiyouga referenced this issue in hiyouga/LLaMA-Factory Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Tokenization Internals of the library; Tokenization.
Projects
None yet
Development

No branches or pull requests

7 participants