Cannot load codeqwen Tokenizer #30324

zch-cc · 2024-04-18T16:12:58Z

System Info

transformers version: 4.40.0
Platform: Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
Python version: 3.8.10
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.2
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I cannot load model CodeQwen with AutoTokenizer
Code Qwen comes out this week and I would like to use it. https://huggingface.co/Qwen/CodeQwen1.5-7B
But I run into this error when loading the tokenizer

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/CodeQwen1.5-7B-Chat")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 862, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
    return cls._from_pretrained(
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/chonghao/.local/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 111, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 12564 column 3

According to ggerganov/llama.cpp#6707 the difference between codeqwen and qwen1.5 is they use different tokenizer based on sentencepiece

Expected behavior

This issue maybe belong to a new model support but I think changing the tokenizer will work Successfully load the tokenizer

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-18T19:14:45Z

Hi @zch-cc, thanks for reporting!

I'm able to replicate. From bisect, seems to be coming from #30289 cc @Narsil @ArthurZucker. Happens even when upgrading tokenizers.

sdadas · 2024-04-19T09:12:51Z

This problem also affects other models, for example https://huggingface.co/sdadas/mmlw-retrieval-roberta-large
One thing the models have in common is the use of Metaspace pre-tokenizer, so this could be the source of the error.

wertyac · 2024-04-20T06:52:19Z

The same issue for me of CodeQwen 1.5 7B

yungchentang · 2024-04-20T14:15:26Z

same issue!

fdeh75 · 2024-04-22T07:56:51Z

Quick and dirty solution: downgrade tokenizers

requrements.txt:

tokenizers==0.15.2
transformers<4.40.0

$ pip install -r requrements.txt

It worked for me

ArthurZucker · 2024-04-22T18:15:40Z

Mmm there seems to be something wrong with the serialization / de-serialization.
Let me check what I can do here

ArthurZucker · 2024-04-23T12:50:49Z

I think the authors fixed the format 🤗 It was probably not serialized with 0.15.2 nor 0.19.1

amyeroberts added the Core: Tokenization Internals of the library; Tokenization. label Apr 19, 2024

hiyouga referenced this issue in hiyouga/LLaMA-Factory Apr 29, 2024

add CodeQwen models

5f86053

ArthurZucker closed this as completed Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load codeqwen Tokenizer #30324

Cannot load codeqwen Tokenizer #30324

zch-cc commented Apr 18, 2024 •

edited by ArthurZucker

Loading

amyeroberts commented Apr 18, 2024

sdadas commented Apr 19, 2024

wertyac commented Apr 20, 2024

yungchentang commented Apr 20, 2024

fdeh75 commented Apr 22, 2024

ArthurZucker commented Apr 22, 2024

ArthurZucker commented Apr 23, 2024

Cannot load codeqwen Tokenizer #30324

Cannot load codeqwen Tokenizer #30324

Comments

zch-cc commented Apr 18, 2024 • edited by ArthurZucker Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 18, 2024

sdadas commented Apr 19, 2024

wertyac commented Apr 20, 2024

yungchentang commented Apr 20, 2024

fdeh75 commented Apr 22, 2024

ArthurZucker commented Apr 22, 2024

ArthurZucker commented Apr 23, 2024

zch-cc commented Apr 18, 2024 •

edited by ArthurZucker

Loading