-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: Add model vocab support #7117
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this work in progress? It seems to miss implementation in llama.cpp
. Those tokenizers won't be recognized leading to crash with runtime error.
@Galunid Yes, it's still a work in progress. I was passing out while still implementing because I wanted to get it out of the way, so I decided to pause for a bit. |
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Possible regression with llama-spm? Start 11: test-tokenizer-1-llama-spm
11/24 Test #11: test-tokenizer-1-llama-spm .......Subprocess aborted***Exception: 0.86 sec Looking into it. |
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
@teleprint-me Should not |
I had already committed it and was waiting to be able to merge the PRs into this branch. I can remove it after some testing if it really isn't needed, but the pattern is useful to see the potential variety of implementations that we'll be dealing with. The original Qwen repos use BPE with tiktoken and a build script which is hooked into transformers tokenizer. While the vocab might be the same, the process for getting it is completely different. It's good to know about it, even if the tokenizer itself isn't useful in this context. |
This can be overridden with the -m command line option ref: ggml-org#7293
…7284)" (ggml-org#7334) This reverts commit 583fd6b.
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Signed-off-by: teleprint-me <77757836+teleprint-me@users.noreply.github.com>
Hm. I messed up the merge. Usually auto rebases downstream merges. That's no good. |
@teleprint-me your PR is not working for stablelm models, why the pre-tokenizer is stablelm if I remember correctly it should be gpt-2 right ? |
@aahouzi Yes, you're correct. I think I know why this is happening. I'm wondering if it's because of the def _set_vocab_gpt2(self) -> None:
tokens, toktypes, tokpre = self.get_vocab_base()
self.gguf_writer.add_tokenizer_model("gpt2")
self.gguf_writer.add_tokenizer_pre(tokpre)
self.gguf_writer.add_token_list(tokens)
self.gguf_writer.add_token_types(toktypes)
I fixed phi-2 by directly adding Then override on a model-by-model basis as needed might be a better approach? Need feedback on how this might affect other models. @compilade |
Okay, I get it now. This verifies my initial intuition about how the mapping was setup in the update script. This needs to be refactored somehow. We can't rely on a name as an id for the models pre-tokenizer. # used for GPT-2 BPE and WordPiece vocabs
def get_vocab_base(self) -> tuple[list[str], list[int], str]:
tokens: list[str] = []
toktypes: list[int] = []
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(self.dir_model)
vocab_size = self.hparams.get("vocab_size", len(tokenizer.vocab))
assert max(tokenizer.vocab.values()) < vocab_size
tokpre = self.get_vocab_base_pre(tokenizer)
# omitting for brevity
return tokens, toktypes, tokpre The models vocab may be modified downstream and conversions will fail as a result even if the architecture is clearly supported. This is creating name conflicts. Turns out the phi-2 bug in #7219 and #7300 is a symptom of a more deeply rooted issue. @ggerganov |
I think I can fully automate this entire process and reduce the complexity. Not sure yet. Need to experiment. |
Phi-2 has this pre-tokenizer (from its "pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
} The
Hmm, no I don't think this will work for architectures which are used with different pre-tokenizers. For example, the
Yes, this is definitely possible. A starting point would be to compare the The question probably is then: can pre-tokenizers entirely be identified by the (EDIT: maybe this would be problematic with some models which use a custom tokenizer like in |
@compilade for StableLM3B, the picked hash is from olmo's pre-tokenizer |
@compilade Yeah, I tried this already, and it proved to be incredibly complicated. Many of us have already come to the conclusion that there is no reliable way to do this, so I'm thinking maybe we lean into that instead of veering away from it. A weakness can be utilized as a strength just as a perceived strength can be a weakness.
@aahouzi I didn't implement it yet. I've been observing PRs, attempting to identify a useful pattern. |
@aahouzi Which seems appropriate, considering OLMo also uses GPT-2's regex for its pre-tokenizer: From the "pre_tokenizer": {
"type": "ByteLevel",
"add_prefix_space": false,
"trim_offsets": true,
"use_regex": true
} And OLMo uses GPT-2's regex in |
I think I figured out how to automate the tokenizer, model, checksum, and conversions all in one go. Will close this PR and open a new PR when I'm ready to post for huggingface related tasks. |
There is also the |
The issue is determining what the The metadata for the For example, the llama bpe normalizer is defined as Same for other types of related metadata such as added tokens, special tokens, model type, etc. It seems that the |
Hm. >>> tokenizer.backend_tokenizer.normalizer
>>> type(tokenizer.backend_tokenizer.normalizer)
<class 'NoneType'>
>>> from tokenizers import normalizers
>>> from tokenizers.normalizers import NFD, StripAccents
>>> normalizer = normalizers.Sequence([NFD(), StripAccents()])
>>> tokenizer.backend_tokenizer.normalizer = normalizer
>>> tokenizer.backend_tokenizer.normalizer
<tokenizers.normalizers.Sequence object at 0x76840738e1f0>
>>> tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?")
'Hello how are u?' Not sure how reliable this would be? Plus, this is only really needed for the conversion? The vocab is already pre-existing, extracted, and then written to the model file. So, what's the plan here? I guess it depends on what we're looking to do.
These are just some off-the-cuff check boxes. TBH, it's confusing because the e.g., What kind of tokenizer does the model depend on? I'm just rolling with it right now. The remaining metadata really just depends on the intent and purpose, most of which is already utilized. |
Superseded by PR #7379 |
Perhaps this should be implemented anyway, while the automation project is still ongoing? |
ref #6920 comment
supersedes #7018
Adds the following models:
qwen(qwen2 supersedes qwen)Adds the following extras: