Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fallback to tokenizer.json if vocab.json does not exist #6245

Closed
wants to merge 1 commit into from

Conversation

CISC
Copy link
Contributor

@CISC CISC commented Mar 22, 2024

Fixes #6238
Fixes #6216
Fixes #5973

Copy link

@froggeric froggeric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed working. Tested with the previously mentioned WhiteRabbitNeo model. Conversion works fine with --pad-vocab --vocab-type bpe. Without --vocab-type, it incorrectly identifies it at hfft and still produces a non-working file, which I think is back to the original behaviour.

Copy link

@froggeric froggeric left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed code changes. Change from single item to list is ok, and adds missing fallback.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Mar 24, 2024

deepseek-coder-33b-instruct and WhiteRabbitNeo-33B-v1.5 both specify LlamaTokenizer (SPM) as their tokenizer, but they obviously use a tokenizer.json compatible with BPE. This inconsistency should be reported upstream, as it is the root cause of this issue - if they specified GPT2Tokenizer it would most likely work without any need for --vocab-type.

@CISC
Copy link
Contributor Author

CISC commented Mar 24, 2024

@cebtenzzre It should probably be reported upstream, but it's a little late now as there are plenty of DeepSeek derivatives, however I don't see how it would have any impact on convert.py as it would still not work without --vocab-type (BPE is not considered by default), even with existing vocab.json. This PR restores previous behaviour if vocab.json is missing.

@froggeric Thank you for reviewing. :)

@cebtenzzre
Copy link
Collaborator

I don't see how it would have any impact on convert.py

The default vocab type as of my PR is spm,hfft. If it weren't for a) the bytes_to_unicode mapping being moved out of the conversion scripts in #3252, b) deepseek models claiming to use the Llama tokenizer, and c) tokenizer_model of HfVocab being fixed to llama instead of based on tokenizer_class, the hfft step would just work.

Copy link
Collaborator

@cebtenzzre cebtenzzre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, could we also throw an exception in HfVocab if type(self.tokenizer) isn't one of:

LlamaTokenizer
LlamaTokenizerFast
CodeLlamaTokenizer
CodeLlamaTokenizerFast
GemmaTokenizer
GemmaTokenizerFast

?

@CISC
Copy link
Contributor Author

CISC commented Mar 25, 2024

Actually, could we also throw an exception in HfVocab if type(self.tokenizer) isn't one of:
...

Well, two issues with that:

  • Why add an artificial limitation (that would also need to be maintained for no apparent reason)?
  • It's not relevant to this PR, should be a separate one IMO.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Mar 25, 2024

  • Why add an artificial limitation (that would also need to be maintained for no apparent reason)?

This issue manifests as a crash at runtime instead of an error at conversion time. This is because of two reasons:

  • The model actually claims to have an spm vocab, even though it does not;
  • HfVocab uses AutoTokenizer, yet it assumes that the model has an spm vocab.

If that check is added, one will not waste time trying to convert e.g. Aquila-7B with the default --vocab-type and then wondering why it crashes with an unclear error, which is a problem that didn't happen when hfft was opt-in. It would also help once the tokenizer_class in the deepseek models are fixed.

@CISC
Copy link
Contributor Author

CISC commented Mar 25, 2024

This issue manifests as a crash at runtime instead of an error at conversion time. This is because of two reasons:

  • The model actually claims to have an spm vocab, even though it does not;

Does it? The config sets it to LlamaTokenizerFast and that works fine for conversion (and running on transformers), it only fails when loading as SPM instead of BPE. I'm guessing transformers is more lenient than llama.cpp?

Either way I fear blocking conversion on unspecified tokenizer classes will become unmaintable.

  • HfVocab uses AutoTokenizer, yet it assumes that the model has an spm vocab.

I think perhaps this is a larger part of the issue, looking at #6252 it would need to set a special tokenizer_model for DeepSeek, doubt we'll be able to do that automatically, so either way we need special (probably manual) handling.

If that check is added, one will not waste time trying to convert e.g. Aquila-7B with the default --vocab-type and then wondering why it crashes with an unclear error, which is a problem that didn't happen when hfft was opt-in. It would also help once the tokenizer_class in the deepseek models are fixed.

Only because spm,bpe was default then. :)

@cebtenzzre
Copy link
Collaborator

Does it? The config sets it to LlamaTokenizerFast and that works fine for conversion (and running on transformers), it only fails when loading as SPM instead of BPE. I'm guessing transformers is more lenient than llama.cpp?

True, it does work in practice, even though it's the wrong class - you can tell from tokenizer.json that this is a GPT2-style BPE tokenizer, not a Llama-style BPE tokenizer. The better way is to inspect tokenizer.json to identify which kind of tokenizer it is - I'm working on a PR that does that, as well as some other refactoring.

Only because spm,bpe was default then. :)

Was spm,bpe ever the default? Before my PR, spm was the default, so it would fail unless tokenizer.model was found:

parser.add_argument("--vocab-type", choices=vocab_types, help="The vocabulary format used to define the tokenizer model (default: spm)", default="spm")

llama.cpp/convert.py

Lines 1335 to 1338 in c7a0ad8

elif vocabtype == "spm":
vocab = SentencePieceVocab(
path, added_tokens_path if added_tokens_path.exists() else None
)

@CISC
Copy link
Contributor Author

CISC commented Mar 27, 2024

True, it does work in practice, even though it's the wrong class - you can tell from tokenizer.json that this is a GPT2-style BPE tokenizer, not a Llama-style BPE tokenizer. The better way is to inspect tokenizer.json to identify which kind of tokenizer it is - I'm working on a PR that does that, as well as some other refactoring.

That sounds like a much nicer solution, hope that works out.

Was spm,bpe ever the default? Before my PR, spm was the default, so it would fail unless tokenizer.model was found:

No, sorry, I got mixed up.

@CISC CISC deleted the convert-bpe-regression branch April 2, 2024 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants