Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: cannot find tokenizer merges in model file #9692

Closed
nd791899 opened this issue Sep 30, 2024 · 11 comments
Closed

Bug: cannot find tokenizer merges in model file #9692

nd791899 opened this issue Sep 30, 2024 · 11 comments
Labels
bug Something isn't working high priority Very important issue high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)

Comments

@nd791899
Copy link

What happened?

When I use transformers==4.45.1 and convert llama.cpp to the file used by ollama, there is no error, but when I load the model with ollama, the error ollama cannot find tokenizer merges in model file appears

Name and Version

所有版本

What operating system are you seeing the problem on?

No response

Relevant log output

No response

@nd791899 nd791899 added bug-unconfirmed high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow) labels Sep 30, 2024
@patrick60507
Copy link

same problem

@nd791899
Copy link
Author

gguf-py/gguf/vocab.py
def add_to_gguf(self, gw: GGUFWriter, quiet: bool = False) -> None:
report:
Adding merges requested but no merges found, output may be non-functional.

and in _try_load_from_tokenizer_json function:

def _try_load_from_tokenizer_json(self, path: Path) -> bool:
    tokenizer_file = path / 'tokenizer.json'
    if tokenizer_file.is_file():
        with open(tokenizer_file, encoding = 'utf-8') as f:
        if self.load_merges:
            merges = tokenizer.get('model', {}).get('merges')
            if isinstance(merges, list) and merges and isinstance(merges[0], str):
                self.merges = merges
        added_tokens = tokenizer.get('added_tokens', {})

isinstance(merges[0], str)
but transformers==4.45.1 generate tokenizer.json ,in tokenizer.json merges is list。

Can be compatible?

@Vaibhavs10
Copy link
Collaborator

Vaibhavs10 commented Sep 30, 2024

Hey hey, I'm VB from the open source team at Hugging Face. I can confirm that this is due to an update we've made to tokenizers - we persists merges as a list vs strings.

Everything should work on transformers 4.44.0 however from 4.45.0 onward it won't work and we'd need to add support for it.

For reference this is the tokenizers PR that introduced it: huggingface/tokenizers#909

@Vaibhavs10 Vaibhavs10 added bug Something isn't working and removed bug-unconfirmed labels Sep 30, 2024
@danielhanchen
Copy link
Contributor

There are some temporary fixes which downgrade transformers to 4.44.2 for Unsloth members here: unslothai/unsloth#1065 and unslothai/unsloth#1062

@ggerganov
Copy link
Owner

Tagging @compilade for any insights how to best resolve this.

@pcuenca
Copy link
Contributor

pcuenca commented Sep 30, 2024

A couple of repos for testing:

The difference is the way merges are serialized in the tokenizer.json file. Each merge pair used to be a string with a space separating the two merges, but now each pair is saved as an array.

@ggerganov
Copy link
Owner

ggerganov commented Sep 30, 2024

@pcuenca Thanks, I confirm that if I update to transformers 4.45 the conversion of this models succeeds using convert_hf_to_gguf.py. Without upgrading - it fails.

I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers? The latter seems the obvious solution to me, but I could be missing something.

@pcuenca
Copy link
Contributor

pcuenca commented Sep 30, 2024

In my opinion, I think upgrading transformers is easier.

@Vaibhavs10
Copy link
Collaborator

Vaibhavs10 commented Sep 30, 2024

Opened a PR to update the transformers version in the short term: #9694
(the CI errors look like warnings - not sure what to do about it)

We tested it with the new format and the old format:

  1. (new tokenisers format) https://huggingface.co/pcuenq/Qwen2.5-0.5B-Instruct-with-new-merges-serialization-Q8_0-GGUF
  2. (old tokenisers format) https://huggingface.co/pcuenq/Llama-3.2-1B-Instruct-Q8_0-GGUF

@compilade
Copy link
Collaborator

Upgrading to transformers 4.45 likely isn't enough; gguf.SpecialVocab(dir_model, load_merges=True) only works with the old format while silently ignoring everything else:

if self.load_merges:
merges = tokenizer.get('model', {}).get('merges')
if isinstance(merges, list) and merges and isinstance(merges[0], str):
self.merges = merges

I wonder, should we try to find a way to make convert_hf_to_gguf.py work with pre-4.45 or should we just prompt the user to upgrade their transformers?

To support the new format with older versions of transformers, that would require to avoid using AutoTokenizer.from_pretrained and/or fallback to full manual parsing of tokenizer.json. But that would not work with the current pre-tokenizer autodetection which relies on tokenizing strings.

So transformers has to be updated to 4.45 and gguf-py/gguf/vocab.py needs to be adapted to the new serialization, as in #9696

@ggerganov ggerganov unpinned this issue Oct 3, 2024
@ggerganov
Copy link
Owner

Should be resolved now. @nd791899 please close if it has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high priority Very important issue high severity Used to report high severity bugs in llama.cpp (Malfunctioning hinder important workflow)
Projects
None yet
Development

No branches or pull requests

7 participants