Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting #142

Ronsor · 2023-03-14T20:34:30Z

When converting the model + tokenizer, use the vocabulary size returned by the tokenizer rather than assuming 32000.

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

…th-to-ggml.py There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

…th-to-ggml.py (#142) There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

…th-to-ggml.py (ggml-org#142) There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

Use tokenizer.vocab_size() instead of hardcoding 32000 in convert-p…

9da4e66

…th-to-ggml.py There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

ggerganov approved these changes Mar 15, 2023

View reviewed changes

ggerganov merged commit 956dfda into ggml-org:master Mar 15, 2023

Ronsor deleted the patch-2 branch March 17, 2023 00:57

sw mentioned this pull request Mar 23, 2023

Mismatch in Vocabulary Size: Investigating Inconsistencies between Token-to-ID and ID-to-Token Dictionaries #413

Closed

Bearsaerker mentioned this pull request Mar 12, 2025

Eval bug: Gemma 3 extremly slow prompt processing when using quantized kv cache. #12352

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting #142

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting #142

Ronsor commented Mar 14, 2023

Use tokenizer.vocab_size() instead of hardcoding 32000 when converting #142

Use tokenizer.vocab_size() instead of hardcoding 32000 when converting #142

Conversation

Ronsor commented Mar 14, 2023

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting #142

Use `tokenizer.vocab_size()` instead of hardcoding 32000 when converting #142