Bug: cannot find tokenizer merges in model file #9692

nd791899 · 2024-09-30T02:31:24Z

What happened?

When I use transformers==4.45.1 and convert llama.cpp to the file used by ollama, there is no error, but when I load the model with ollama, the error ollama cannot find tokenizer merges in model file appears

Name and Version

所有版本

What operating system are you seeing the problem on?

No response

Relevant log output

No response

patrick60507 · 2024-09-30T03:45:53Z

same problem

nd791899 · 2024-09-30T05:39:52Z

gguf-py/gguf/vocab.py
def add_to_gguf(self, gw: GGUFWriter, quiet: bool = False) -> None:
report：
Adding merges requested but no merges found, output may be non-functional.

and in _try_load_from_tokenizer_json function：

def _try_load_from_tokenizer_json(self, path: Path) -> bool:
    tokenizer_file = path / 'tokenizer.json'
    if tokenizer_file.is_file():
        with open(tokenizer_file, encoding = 'utf-8') as f:
        if self.load_merges:
            merges = tokenizer.get('model', {}).get('merges')
            if isinstance(merges, list) and merges and isinstance(merges[0], str):
                self.merges = merges
        added_tokens = tokenizer.get('added_tokens', {})

isinstance(merges[0], str)
but transformers==4.45.1 generate tokenizer.json ，in tokenizer.json merges is list。

Can be compatible?

Vaibhavs10 · 2024-09-30T10:01:44Z

Hey hey, I'm VB from the open source team at Hugging Face. I can confirm that this is due to an update we've made to tokenizers - we persists merges as a list vs strings.

Everything should work on transformers 4.44.0 however from 4.45.0 onward it won't work and we'd need to add support for it.

For reference this is the tokenizers PR that introduced it: huggingface/tokenizers#909

danielhanchen · 2024-09-30T10:10:32Z

There are some temporary fixes which downgrade transformers to 4.44.2 for Unsloth members here: unslothai/unsloth#1065 and unslothai/unsloth#1062

ggerganov · 2024-09-30T10:14:54Z

Tagging @compilade for any insights how to best resolve this.

pcuenca · 2024-09-30T10:41:20Z

A couple of repos for testing:

This is a Qwen model that was exported from transformers 4.45 and therefore uses the new tokenizer serialization format.
This one is just a converted Llama 3.2 tokenizer.

The difference is the way merges are serialized in the tokenizer.json file. Each merge pair used to be a string with a space separating the two merges, but now each pair is saved as an array.

ggerganov · 2024-09-30T10:51:50Z

@pcuenca Thanks, I confirm that if I update to transformers 4.45 the conversion of this models succeeds using convert_hf_to_gguf.py. Without upgrading - it fails.

I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers? The latter seems the obvious solution to me, but I could be missing something.

pcuenca · 2024-09-30T11:19:16Z

In my opinion, I think upgrading transformers is easier.

Vaibhavs10 · 2024-09-30T14:27:37Z

Opened a PR to update the transformers version in the short term: #9694
(the CI errors look like warnings - not sure what to do about it)

We tested it with the new format and the old format:

(new tokenisers format) https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization-Q8_0-GGUF
(old tokenisers format) https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-Q8_0-GGUF

compilade · 2024-09-30T15:24:50Z

Upgrading to transformers 4.45 likely isn't enough; gguf.SpecialVocab(dir_model, load_merges=True) only works with the old format while silently ignoring everything else:

llama.cpp/gguf-py/gguf/vocab.py

Lines 123 to 126 in 8277a81

    
           if self.load_merges: 
        
               merges = tokenizer.get('model', {}).get('merges') 
        
               if isinstance(merges, list) and merges and isinstance(merges[0], str): 
        
                   self.merges = merges

I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers?

To support the new format with older versions of transformers, that would require to avoid using AutoTokenizer.from_pretrained and/or fallback to full manual parsing of tokenizer.json. But that would not work with the current pre-tokenizer autodetection which relies on tokenizing strings.

So transformers has to be updated to 4.45 and gguf-py/gguf/vocab.py needs to be adapted to the new serialization, as in #9696

ggerganov · 2024-10-03T14:23:37Z

Should be resolved now. @nd791899 please close if it has been resolved.

nd791899 added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Sep 30, 2024

Vaibhavs10 added bug Something isn't working and removed bug-unconfirmed labels Sep 30, 2024

ggerganov added the high priority Very important issue label Sep 30, 2024

ggerganov pinned this issue Sep 30, 2024

This was referenced Sep 30, 2024

cannot find tokenizer merges in model file ngxson/wllama#120

Open

Error: llama runner process has terminated: error loading modelvocabulary: cannot find tokenizer merges in model file ollama/ollama#7038

Open

This was referenced Sep 30, 2024

[TEMP FIX] Ollama / llama.cpp: cannot find tokenizer merges in model file [duplicate] unslothai/unsloth#1062

Open

[TEMP FIX] Ollama / llama.cpp: cannot find tokenizer merges in model file unslothai/unsloth#1065

Open

compilade mentioned this issue Sep 30, 2024

convert : handle tokenizer merges format from transformers 4.45 #9696

Merged

2 tasks

ggerganov unpinned this issue Oct 3, 2024

nd791899 closed this as completed Oct 8, 2024

robertknight mentioned this issue Oct 21, 2024

Parsing of merges field in tokenizer.json fails robertknight/rten#391

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: cannot find tokenizer merges in model file #9692

Bug: cannot find tokenizer merges in model file #9692

nd791899 commented Sep 30, 2024

patrick60507 commented Sep 30, 2024

nd791899 commented Sep 30, 2024

Vaibhavs10 commented Sep 30, 2024 •

edited

Loading

danielhanchen commented Sep 30, 2024

ggerganov commented Sep 30, 2024

pcuenca commented Sep 30, 2024

ggerganov commented Sep 30, 2024 •

edited

Loading

pcuenca commented Sep 30, 2024

Vaibhavs10 commented Sep 30, 2024 •

edited

Loading

compilade commented Sep 30, 2024

ggerganov commented Oct 3, 2024

Bug: cannot find tokenizer merges in model file #9692

Bug: cannot find tokenizer merges in model file #9692

Comments

nd791899 commented Sep 30, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

patrick60507 commented Sep 30, 2024

nd791899 commented Sep 30, 2024

Vaibhavs10 commented Sep 30, 2024 • edited Loading

danielhanchen commented Sep 30, 2024

ggerganov commented Sep 30, 2024

pcuenca commented Sep 30, 2024

ggerganov commented Sep 30, 2024 • edited Loading

pcuenca commented Sep 30, 2024

Vaibhavs10 commented Sep 30, 2024 • edited Loading

compilade commented Sep 30, 2024

ggerganov commented Oct 3, 2024

Vaibhavs10 commented Sep 30, 2024 •

edited

Loading

ggerganov commented Sep 30, 2024 •

edited

Loading

Vaibhavs10 commented Sep 30, 2024 •

edited

Loading