Fallback to tokenizer.json if vocab.json does not exist #6245

CISC · 2024-03-22T21:16:38Z

froggeric

Confirmed working. Tested with the previously mentioned WhiteRabbitNeo model. Conversion works fine with --pad-vocab --vocab-type bpe. Without --vocab-type, it incorrectly identifies it at hfft and still produces a non-working file, which I think is back to the original behaviour.

froggeric

Reviewed code changes. Change from single item to list is ok, and adds missing fallback.

cebtenzzre · 2024-03-24T01:58:27Z

deepseek-coder-33b-instruct and WhiteRabbitNeo-33B-v1.5 both specify LlamaTokenizer (SPM) as their tokenizer, but they obviously use a tokenizer.json compatible with BPE. This inconsistency should be reported upstream, as it is the root cause of this issue - if they specified GPT2Tokenizer it would most likely work without any need for --vocab-type.

CISC · 2024-03-24T09:07:28Z

@cebtenzzre It should probably be reported upstream, but it's a little late now as there are plenty of DeepSeek derivatives, however I don't see how it would have any impact on convert.py as it would still not work without --vocab-type (BPE is not considered by default), even with existing vocab.json. This PR restores previous behaviour if vocab.json is missing.

@froggeric Thank you for reviewing. :)

cebtenzzre · 2024-03-25T15:09:48Z

I don't see how it would have any impact on convert.py

The default vocab type as of my PR is spm,hfft. If it weren't for a) the bytes_to_unicode mapping being moved out of the conversion scripts in #3252, b) deepseek models claiming to use the Llama tokenizer, and c) tokenizer_model of HfVocab being fixed to llama instead of based on tokenizer_class, the hfft step would just work.

cebtenzzre

Actually, could we also throw an exception in HfVocab if type(self.tokenizer) isn't one of:

LlamaTokenizer
LlamaTokenizerFast
CodeLlamaTokenizer
CodeLlamaTokenizerFast
GemmaTokenizer
GemmaTokenizerFast

?

CISC · 2024-03-25T18:32:23Z

Actually, could we also throw an exception in HfVocab if type(self.tokenizer) isn't one of:
...

Well, two issues with that:

Why add an artificial limitation (that would also need to be maintained for no apparent reason)?
It's not relevant to this PR, should be a separate one IMO.

cebtenzzre · 2024-03-25T19:36:07Z

Why add an artificial limitation (that would also need to be maintained for no apparent reason)?

This issue manifests as a crash at runtime instead of an error at conversion time. This is because of two reasons:

The model actually claims to have an spm vocab, even though it does not;
HfVocab uses AutoTokenizer, yet it assumes that the model has an spm vocab.

If that check is added, one will not waste time trying to convert e.g. Aquila-7B with the default --vocab-type and then wondering why it crashes with an unclear error, which is a problem that didn't happen when hfft was opt-in. It would also help once the tokenizer_class in the deepseek models are fixed.

CISC · 2024-03-25T22:03:43Z

This issue manifests as a crash at runtime instead of an error at conversion time. This is because of two reasons:

The model actually claims to have an spm vocab, even though it does not;

Does it? The config sets it to LlamaTokenizerFast and that works fine for conversion (and running on transformers), it only fails when loading as SPM instead of BPE. I'm guessing transformers is more lenient than llama.cpp?

Either way I fear blocking conversion on unspecified tokenizer classes will become unmaintable.

HfVocab uses AutoTokenizer, yet it assumes that the model has an spm vocab.

I think perhaps this is a larger part of the issue, looking at #6252 it would need to set a special tokenizer_model for DeepSeek, doubt we'll be able to do that automatically, so either way we need special (probably manual) handling.

If that check is added, one will not waste time trying to convert e.g. Aquila-7B with the default --vocab-type and then wondering why it crashes with an unclear error, which is a problem that didn't happen when hfft was opt-in. It would also help once the tokenizer_class in the deepseek models are fixed.

Only because spm,bpe was default then. :)

cebtenzzre · 2024-03-27T18:32:51Z

Does it? The config sets it to LlamaTokenizerFast and that works fine for conversion (and running on transformers), it only fails when loading as SPM instead of BPE. I'm guessing transformers is more lenient than llama.cpp?

True, it does work in practice, even though it's the wrong class - you can tell from tokenizer.json that this is a GPT2-style BPE tokenizer, not a Llama-style BPE tokenizer. The better way is to inspect tokenizer.json to identify which kind of tokenizer it is - I'm working on a PR that does that, as well as some other refactoring.

Only because spm,bpe was default then. :)

Was spm,bpe ever the default? Before my PR, spm was the default, so it would fail unless tokenizer.model was found:

llama.cpp/convert.py

Line 1390 in c7a0ad8

    
           parser.add_argument("--vocab-type",   choices=vocab_types,    help="The vocabulary format used to define the tokenizer model (default: spm)", default="spm")

llama.cpp/convert.py

Lines 1335 to 1338 in c7a0ad8

    
           elif vocabtype == "spm": 
        
               vocab = SentencePieceVocab( 
        
                   path, added_tokens_path if added_tokens_path.exists() else None 
        
               )

CISC · 2024-03-27T21:21:41Z

True, it does work in practice, even though it's the wrong class - you can tell from tokenizer.json that this is a GPT2-style BPE tokenizer, not a Llama-style BPE tokenizer. The better way is to inspect tokenizer.json to identify which kind of tokenizer it is - I'm working on a PR that does that, as well as some other refactoring.

That sounds like a much nicer solution, hope that works out.

Was spm,bpe ever the default? Before my PR, spm was the default, so it would fail unless tokenizer.model was found:

No, sorry, I got mixed up.

Fallback to tokenizer.json if vocab.json does not exist

a04bdfb

froggeric approved these changes Mar 22, 2024

View reviewed changes

froggeric approved these changes Mar 23, 2024

View reviewed changes

ggerganov requested a review from cebtenzzre March 23, 2024 16:51

cebtenzzre approved these changes Mar 25, 2024

View reviewed changes

cebtenzzre requested changes Mar 25, 2024

View reviewed changes

cebtenzzre mentioned this pull request Mar 27, 2024

convert : refactor vocab selection logic #6355

Merged

cebtenzzre closed this in #6355 Mar 28, 2024

CISC deleted the convert-bpe-regression branch April 2, 2024 09:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fallback to tokenizer.json if vocab.json does not exist #6245

Fallback to tokenizer.json if vocab.json does not exist #6245

CISC commented Mar 22, 2024 •

edited

Loading

froggeric left a comment

froggeric left a comment •

edited

Loading

cebtenzzre commented Mar 24, 2024 •

edited

Loading

CISC commented Mar 24, 2024

cebtenzzre commented Mar 25, 2024

cebtenzzre left a comment

CISC commented Mar 25, 2024

cebtenzzre commented Mar 25, 2024 •

edited

Loading

CISC commented Mar 25, 2024 •

edited

Loading

cebtenzzre commented Mar 27, 2024

CISC commented Mar 27, 2024

Fallback to tokenizer.json if vocab.json does not exist #6245

Fallback to tokenizer.json if vocab.json does not exist #6245

Conversation

CISC commented Mar 22, 2024 • edited Loading

froggeric left a comment

Choose a reason for hiding this comment

froggeric left a comment • edited Loading

Choose a reason for hiding this comment

cebtenzzre commented Mar 24, 2024 • edited Loading

CISC commented Mar 24, 2024

cebtenzzre commented Mar 25, 2024

cebtenzzre left a comment

Choose a reason for hiding this comment

CISC commented Mar 25, 2024

cebtenzzre commented Mar 25, 2024 • edited Loading

CISC commented Mar 25, 2024 • edited Loading

cebtenzzre commented Mar 27, 2024

CISC commented Mar 27, 2024

CISC commented Mar 22, 2024 •

edited

Loading

froggeric left a comment •

edited

Loading

cebtenzzre commented Mar 24, 2024 •

edited

Loading

cebtenzzre commented Mar 25, 2024 •

edited

Loading

CISC commented Mar 25, 2024 •

edited

Loading