-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server: possibility of customizable chat template? #5922
Comments
Users that want to support a certain template should open a PR and implement it in the framework that we already have |
Yeah I thought that would be ideal to do so, but sometimes that's not even enough: maybe user want to fine tune and try out one single template? Another problem is that currently there're models don't have jinja template, for example the old alpaca The current proposed solution is to use My proposal in this PR is not complete, so I'll let it here to see if anyone come up with another use case (or another idea) that we've never though about for example. |
Yeah, this is why I said templates should be the responsibility of the user. It's why I always use the completions endpoint and avoid any chat template enforcement. The problem is simple to understand once you understand how a tokenizer is created and trained. ANY token can be used. ANY TOKEN. This ambiguity is problematic for requiring even a loose set of rules. This problem will exist even if the industry agrees upon a standard. ChatML is nice, but it's only a single use case and doesn't solve the wider issue at hand. This also makes completions more valuable because they're the most flexible. Completions also set the foundation or stage for chat fine-tuning. It sucks that OpenAI took it down, but it's what I always use when I use llama.cpp. Completions are literally the only reason I use llama.cpp. There's so much more flexibility that way. Just put the responsibility on the user, end of discussion. This isn't a conclusion I came to lightly. It took time, research, and experimentation to figure this out. This is why I backed and proposed this idea. This is the most viable solution for the interim, and even then, this solution fails miserably with FIM and other templates. It's not that I think this is an impossible problem, but it is the type of problem that will create an entirely different set of problems that compound one another and eventually becomes a bottleneck in the best case scenario. I really do understand the temptation here, but it's best avoided. |
Thanks for your input. For clarification, I'm not saying that my proposal solves all the issues we have with chat templates in general. If I was that confident, I could have just made a PR instead. I also not assuming that supporting custom chat template is a good idea or not. I'm still learning here. I understand your point. It's a valid reason not to have this feature and I appreciate that. However, I will still keep this issue open for a while to collect some feedback, maybe will be helpful if we change the decision in the future. |
I think this is the middle of the road solution which is good. I just keep reiterating it because the tokens are dictated by the tokenizer and the settings used to train the tokenizer. Then the chat template is fine-tuned with any added tokens. All of the tokens for the model are (usually, but unfortunately not always) in there, e.g. So, to clarify, I support your proposal. Once the tokenizer is understood, managing the templates for the model becomes more intuitive. A really great way to get a feel for it is with completions. |
I came across ollama/ollama#1977 and feeling like we're in the middle of a "war of template". You're right @teleprint-me , there's temptation, but better to avoid it at least in this stage. Edit: Still I'm feeling quite lucky because in llama.cpp we have |
I'd love to have it automated, it would be great. I forgot where I stated it, but I remember reiterating that this is similar to "Hilbert's paradox of the Grand Hotel" which "is a thought experiment which illustrates a counterintuitive property of infinite sets". This issue arises because of the desire to support many models with a variety of templates. Model developers can choose to set up the template however they'd like and so can fine-tuners. The moment you begin baking in special tokens, chat templates, and more, is the moment you've bound yourself to an uncanny solution that exponentially becomes more difficult to manage over time. You'll always need to accommodate another "guest". The simplest solution is to create an API or Framework that developers can plugin to. @ggerganov actually suggested this same solution awhile ago. I recommended this solution multiple times. I've been advocating to place the chat template as the responsibility of the user. My rationale is to keep the code and API simple and digestible. I'm confident that there is a way to find a middle ground, but we'll need to work towards that middle ground. I think your idea is actually sound and the reason is because it's simple and flexible. The motto around here seems to be to not over engineer, but supporting chat templates will require much more than over engineering and this doesn't include the maintenance that will ensue as a result. It has technical debt written all over it. I think using the prefix and postfix for prompts is probably the best we can do until templates become solidified. It's still early and we're just getting started. It's better to observe and learn as we progress. Once a pattern emerges, we can use that as an anchor. |
As an aside, I'd love to build a custom tokenizer for llama.cpp. I think it would be great. We could use it for training and fine-tuning. I haven't looked at the backend lately, but back-propagation would obviously help for updating the weights. What would be really neat is training and fine-tuning quants. If I remember correctly, the softmax outputs the logits and we just update backward pass with cross-entropy (feel free to correct me, I'm still learning). Now that would be really cool :) |
Re: #6726 (comment)
My hunch is that code logic in templates can still be avoided, if the configuration provides enough flexibility. For example, providing an alternate template based on message index: {
"system": {
"prefix": "<s>[INST] <<SYS>>\n",
"postfix": "\n<<SYS>>\n\n"
},
"user_1": {
"prefix": "",
"postfix": " [/INST] "
},
"user": {
"prefix": "<s>[INST] ",
"postfix": " [/INST] "
},
"assistant": {
"prefix": "",
"postfix": " </s>"
}
} Or more generally: {
...
"user": {
"prefix": "<s>[INST] ",
"postfix": " [/INST] "
"override": {
"1": {
"prefix": ""
}
}
},
...
} Though I wonder if a 1-off workaround for first user message might even be enough. |
Re: #6726 (comment)
Might just be me, but I slightly prefer the aesthetic / concise legibility of seeing the entire message in context: {
"system": "<s>[INST] <<SYS>>\n{{content}}<<SYS>>\n\n",
"user_1": "{{content}} [/INST] ",
"user": "<s>[INST] {{content}} [/INST] ",
"assistant": "{{content}} </s>"
} Not a big deal, because both are more legible than a string of escaped Jinja. As for injection risk, this wouldn't need to execute code — just do a string replacement. Maybe I'm overlooking something here? |
@teleprint-me Seems like chat templates brushes up against a macro question around the ideal scope of llama.cpp and this server "example" in general. But whether the |
@kaizau I personally prefer having postfix/prefix explicitly, since it makes the cpp code more readable. I think the format you proposed is more suitable for higher level programming languages, since the parser can be just one or two line of codes. |
@kaizau I agree with your assessment. |
The injection risk I was talking about is more about user input containing special tokens like: When templated into a string |
Please do have a look at the code in the below PR. Around the time when llama3 came out, I had a need to look at llama.cpp and inturn I worked on below, to try and see if one can have a generic flow which is driven by a config file to try and accomodate different modes/chat-handshake-template-standards in a generic and flexible way. The idea being that if a new template standard is added during finetuning of a model or if a new model or standard comes out, but which follows a sane convention matching the commonality that I have noticed across many models/standards, then the generic code flow itself can be used, by just updating the config file, without having to add a custom template block. This inturn can be used by example/main, example/server, as well as other users of llama.cpp library. Currently main has been patched to use this config file based flow inturn piggy backing on its existing interactive mode and its in-prefix, in-suffix, antiprompt to a great extent. Based on some minimal testing at my end, I seem to be able to handle the nitty gritties of around 8(+1) model using this generic code + config file based flow. Currently json format is used for the config file, but if needed can be switched to a simpler text based config file, to avoid users of the llama.cpp library from needing to depend on json library. The generic code flow uses a concept similar to what this PR is also thinking ie a generic code flow driven by a config file. Also the generic flow additionally takes care of
You can look at the examples/chaton_meta.json which has entries for the 8(+1) models/standard, which I have tested with my patch. |
Agreed, which I asked about here issues/6982. As ngxson pointed out "the code is so simple", we can write it ourselves in whatever frontend we use. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Motivation
While we already have support for known chat templates, it sometimes not enough for users who want to:
The problem is that other implementations of chat template out there are also quite messy, for example:
system
-user
-assistant
, but technically it's possible to have custom roles likedatabase
,function
,search-engine
,...)Possible implementation
My idea is to have a simple JSON format that take into account all roles:
User can specify the custom template via
--chat-template-file ./my_template.json
The cpp code will be as simple as:
NOTE: This function does not take into account models that does not support system prompt for now, but this function can be added in the future, maybe toggle via an attribute inside json
"system_inside_user_message": true
Ref:
The text was updated successfully, but these errors were encountered: