Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tokenizer.vocab_size() instead of hardcoding 32000 when converting #142

Merged
merged 1 commit into from
Mar 15, 2023

Conversation

Ronsor
Copy link
Contributor

@Ronsor Ronsor commented Mar 14, 2023

When converting the model + tokenizer, use the vocabulary size returned by the tokenizer rather than assuming 32000.

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.

…th-to-ggml.py

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
@ggerganov ggerganov merged commit 956dfda into ggml-org:master Mar 15, 2023
blackhole89 pushed a commit that referenced this pull request Mar 15, 2023
…th-to-ggml.py (#142)

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
@Ronsor Ronsor deleted the patch-2 branch March 17, 2023 00:57
bitRAKE pushed a commit to bitRAKE/llama.cpp that referenced this pull request Mar 17, 2023
…th-to-ggml.py (ggml-org#142)

There are ways that special tokens or other new tokens could be added to the tokenizer; therefore it's probably best not to assume the vocabulary is only 32000 tokens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants