-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fallback to tokenizer.json if vocab.json does not exist #6245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed working. Tested with the previously mentioned WhiteRabbitNeo model. Conversion works fine with --pad-vocab --vocab-type bpe
. Without --vocab-type
, it incorrectly identifies it at hfft
and still produces a non-working file, which I think is back to the original behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed code changes. Change from single item to list is ok, and adds missing fallback.
deepseek-coder-33b-instruct and WhiteRabbitNeo-33B-v1.5 both specify LlamaTokenizer (SPM) as their tokenizer, but they obviously use a tokenizer.json compatible with BPE. This inconsistency should be reported upstream, as it is the root cause of this issue - if they specified GPT2Tokenizer it would most likely work without any need for --vocab-type. |
@cebtenzzre It should probably be reported upstream, but it's a little late now as there are plenty of DeepSeek derivatives, however I don't see how it would have any impact on convert.py as it would still not work without @froggeric Thank you for reviewing. :) |
The default vocab type as of my PR is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, could we also throw an exception in HfVocab if type(self.tokenizer) isn't one of:
LlamaTokenizer
LlamaTokenizerFast
CodeLlamaTokenizer
CodeLlamaTokenizerFast
GemmaTokenizer
GemmaTokenizerFast
?
Well, two issues with that:
|
This issue manifests as a crash at runtime instead of an error at conversion time. This is because of two reasons:
If that check is added, one will not waste time trying to convert e.g. Aquila-7B with the default --vocab-type and then wondering why it crashes with an unclear error, which is a problem that didn't happen when hfft was opt-in. It would also help once the tokenizer_class in the deepseek models are fixed. |
Does it? The config sets it to LlamaTokenizerFast and that works fine for conversion (and running on transformers), it only fails when loading as SPM instead of BPE. I'm guessing transformers is more lenient than llama.cpp? Either way I fear blocking conversion on unspecified tokenizer classes will become unmaintable.
I think perhaps this is a larger part of the issue, looking at #6252 it would need to set a special tokenizer_model for DeepSeek, doubt we'll be able to do that automatically, so either way we need special (probably manual) handling.
Only because spm,bpe was default then. :) |
True, it does work in practice, even though it's the wrong class - you can tell from tokenizer.json that this is a GPT2-style BPE tokenizer, not a Llama-style BPE tokenizer. The better way is to inspect tokenizer.json to identify which kind of tokenizer it is - I'm working on a PR that does that, as well as some other refactoring.
Was spm,bpe ever the default? Before my PR, spm was the default, so it would fail unless tokenizer.model was found: Line 1390 in c7a0ad8
Lines 1335 to 1338 in c7a0ad8
|
That sounds like a much nicer solution, hope that works out.
No, sorry, I got mixed up. |
Fixes #6238
Fixes #6216
Fixes #5973