-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement (properly) different chat templates in main.cpp #6391
Comments
Superseded #6378 |
I am still learning basics of llama.cpp, so maybe I am missing something, but shouldn't there also be separate parameter for main to inject system message in proper place, while using given chat template? Basically something like SYSTEM in Ollama's Modelfile. |
There are 2 models. The tokenizer and then the language model. The tokenizer manages how the input is managed and split into "tokens", I say "tokens" because they can be anything and represent anything. The language model utilizes the tokenizer and is "conditioned" on sequencing of the input. So you could simply "inject" a "system prompt", but it wouldn't make a difference unless the language model was "conditioned" with this in mind. You can do some really interesting things with the "initial prompt" which is typically more evident in a "completion" than a "chat completion". This is really pronounced with the Mistral models because none of them technically have a "system prompt". |
It seems to me that the chatml formatting code in main.cpp is mostly good, sans the anti-prompt issue. Instead of trying to reuse I imagine the change is relatively straightforward and I probably can make a patch, but if it's not the direction the project wants to go, I'll probably just hack up something myself so that I can chat with llama3 in peace :D |
@hnfong The reason why I propose to reuse Btw, the main idea here is to rethink about this subject more seriously before actually implement it - this is to prevent introducing patches, which makes it harder for future contributors and maintainers to work on the project. |
We might be converging to the same idea, but anyway if you want the code to be reuseable I suppose you can refactor the FWIW, I'm happy for whatever solution that gets implemented, the existing templating code in main.cpp is really unnecessarily rigid and cumbersome and needs a makeover. (I got my quick patch working for llama3 though, so hopefully that will last [i.e. no conflicts] until this issue is resolved :) |
My idea is simply call |
Have a look at #6795, it allows one to chat using the existing chat template mechanism from main. In that PR, if you look at the comments, I have attached a simple patch for adding llama3 template support and with that you can talk to llama3 using main. |
Not sure why you want to do that, look at that patch I submitted, it uses the existing interactive/chatml flow of main and modifies it so that the existing chat template logic gets called only once per message, as and when a new message is entered by the user. And I have tested that it works properly with llama2 as well as llama3 (with the template patch I have attached as a comment there). There is one issue, with my patch, which I will look at fixing later today or tomorrow, ie ensuring that the user message doesnt get tokenized wrt special tokens. Given how llama.cpp's chat apply template logic is implemented, both the special tokens (to help identify role to chat-model) and the user message get merged into a single string, which is returned, so that is one thing, I require to handle by extracting the surrounding introduced by chat-apply-template and handling its tokenization seperate from the actual user message |
System prompt is a one time message to handle at the begining and for any template standard which doesnt have system role handled in chat-apply-template, it is better to update chat-apply-template to add system role wrapping, rather than trying to look through messages multiple times. Also even in your case, you still will have to handle the missing system prompt wrapping issue. I feel it is better to add the missing system role wrapping in the chat-apply-template directly, so that it can be used by anyone be it server or main or any other logic that may use. |
Take this case for example: We have 2 messages:
If you apply chat template one-by-one:
That produces:
Which is incorrect. The correct formatted chat is:
Another case that it will fails is llama2 with BOS in-history. The first message in the prompt never has |
Sorry I don't really get this idea, can you please take an example? Anyway, I still don't think having to format all messages every time is that bad. If we think in the perspective of |
I am more of a casual recent enquirer (not a regular user) of LLMs, so I havent really looked deep into them or for that matter even llama.cpp. But my understanding with minimal experimenation is In chat mode, one would use the prompt options ie -p and or -f (ie wrt main program) to send a system prompt (which will concatanate -p + -f contents) once and later keep handshaking the individual query/chat messages to the model and the responses got. And in this case the chat-apply-template would get called once for the system-prompt and once for each of the user query/message. In the above case, I am assuming that all contents of the system-prompt (-p + -f related contents) I mentioned above should be encoded as a single system role message to the model. And let the model differentiate between any system role and sample user/assistant role messages that may have been added towards the end in the system-prompt-file (-f). IE the chat program (main in this case) doesnt bother differentiating between these system and user + assistant role messages in the system prompt. However if the convention used universally is that any sample user+assistant role messages included in the system prompt need to be identified and tagged seperately from system role message in the system prompt, by the chat program before sending to the model, then Possiblity-1 one way to handle that will be to include a flag which gets passed to chat-apply-template, which tells it that it is handling a bunch of messages which belongs to system-prompt (ie which will include 1 system role message and optionally user and assistant role messages in it), and in this case the chat-apply-template depending on which handshake-template-standard it is handling decide whether to introduce or not the INST/equivalents in the middle. Again I am assuming that different handshake standards may be handling this differently, so it is for each block within the chat-apply-template to decide how it will handle its case. But either way be it the system prompt or any subsequent user messages, it needs to go through chat-apply-template only once. Possibility-2 However if all models today have standardized on the handshake high level contours, then the better thing may be to have a bunch of helpers like chat-handshake-system-prefix, chat-hs-system-suffix which provide the tags/tokens for identifying and demarcating the corresponding role messages, as well as a group of messages (which could even be only a single message). and inturn have the chat-apply-template built on top of these individual helpers, which inturn depending on whether it is handling
chain the above helpers in a standard way. In possibility-1, each chat-template-standard block needs to handle things as needed in its block, while in possiblity-2 the individual helpers dont worry about the high level nittygritty, instead chat-apply-template handles high level framing as needed, decoupled from individual-template-standard helpers. NOTE: However with the limitted testing I have done with llama2 and llama3, and simple system prompts with sample user+assistant role messages included to end of system prompt (the prompts/chat-with-bob.txt or so), the dumb flow which my PR does ie treat the full system-prompt as a single system-role message and inturn the subsequent individual user queries as individual user-role messages, seems to be working as needed. NOT sure whether I fully understood what you are trying to tell especially given the very limited exploring of chat models that I have done more as a end user till now. But one of the two possibilities which I mentioned is what one needs to take potentially, depending on whether there is a universal standard followed by most models out there or not. Rather either way, the multitude of helpers based possibility-2 strategy, may be the most flexible and easily reusable for different scenarios one. It can even be adapted for the scenario described for possibility-1. Sorry for the long response. |
Now if you are talking about maintaining the context of all the interactions starting from system prompt to all the subsequent user + assistant dialogs/handshakes, wont each of them still be like independent handshake messages, with their own begin-end markers or so inturn concatenated together, or am I missing something fundamental here. IE are the models handshake interface / input-layer so brain dead that they require one to remove inbetween begin-end markers, in this case, as you seem to be indicating or ... Again as I mentioned I havent really looked much into llms (given that they are still more of a fuzzy+drunk (limited attention inspite of multi-head-attention, position encoding, etal) lossy compressed data pattern logics) and their handshakes, I was trying to use ollama to chat with this new llama3 and given my 8GB machine, the ollama was triggering metal-compute failures, because of some additional memory overhead it seems to be introducing (which if I run llama.cpp directly doesnt seem to create), as well as either way as llama.cpp didnt have llama3 chat template handling the system prompt (which I think ollama uses llama.cpp code to tag wrt handshake template) it was giving a warning telling it doesnt understand the chat template for handing system-prompt message, that is how I end up looking at llama.cpp and inturn added a simple llama3 template in chat-apply-template and inturn modified main to have a simple chat flow using chat-apply-template. |
Also with my PR, the changes which I have done to augument -i, I dont notice it triggering chat-apply-template more than once per user message, and things seem to be fine atleast wrt llama2 and llama3 (with my simple patch for same). However I havent looked at the server program's flow much. Where If I get correctly, you seem to indicate, that you call the chat-apply-template for all the messages again and again, each time you get a request from the web-browser side client logic (I assume because it sends the raw messages always, as it doesnt get the formatted messages or tokens back (I am assuming))? And here inturn is where I assume you are pondering about the inbetween begin-end markers or some aspect of it. However as I didnt notice any problem with my simple main-chat flow (unless some existing code in main already takes care of filtering out inbetween begin-end markers or so, I havent really looked too much into main beyond my immidiate need), doesnt, having the begin-end marker wrt each role message seperately, seem to create any problem in a true sense. Or put in a different way, if things are kept simple and if you where to call chat-apply-template seperately for each of the message in the bunch sent from client side, inturn leading to seperate begin-end markers or so wrt each of the role's message immidietly around it, does it create any issue? |
Also advantage of possibility-2 based kind of flow instead of a all in chat-apply-template, would be that, if there is text in the user's message which mimicks the underlying model's tags for any role or so, ie if it corresponds to a special token; the flow can be structured such that it wont be treated as a special token, when calling tokenizer. Because one can call chat-hs-user-prefix() and tokenize it with parse_special, call tokenize directly on user message without parse_special flag set, and call chat-hs-user-suffix() and tokenize it with parse_special set. And inturn concatenate the 3 tokenized vectors. Which is what the default interactive mode does, while my PR (chaton) doesnt currently. ie the way chat-apply-template is setup currently, where tokenization doesnt occur internally to it but instead where one passes the message, gets the tagged message back and then tokenizes it, there could be this issue of user message bringing in special tokens to bear fruit from their text content. Again I dont know the convention wrt this aspect and current chat programs out there, so mentioning here just in case. Rather for something like the system-prompt initial file (ie -f arg), treating special token related tags in them has special keeps things simple and also potentially adds flexibility to end user to trigger things the way they want. But for individual user messages later, when chatting, is where, whether special token related tags should be treated has special tokens or not may come into picture |
Small correction wrt possiblity-2 should be chat-hs-role-prefix and chat-hs-role-suffix, so that new roles can be added in future, using the same calls and infrastructure. Not sure why I suggested seperate chat-hs-user|system|assitant-prefix|suffix before, I was half asleep rather. Also if chat-apply-template additionaly returns a list of [[startpos1,endpos1], [startpos2,endpos2].....[startposN,endposN]], correpsonding to the user entered messages part, in the returned formatted/tagged string, then chat-apply-template's returned string can be used to ensure that when tokenising, these part corresponding to user messages are tokenized without parse_special |
As I understand, your Possibility-2 uses prefix/postfix system which is discussed in #5922 . The idea lead to nowhere so I though that using postfix/prefix was a bad idea. After a second thought (and also which what you suggested), my idea would be to firstly refactor the chat template system to support the postfix/prefix. This will save a lot of efforts in the future, since chat templates nowadays is just postfix/prefix based (with the exception of llama2 - we can keep using code logic for it)
The
The jinja template is the main reason why I never consider each message to be independent, since sometimes there're conditions inside the template (i.e. only add something if last message has a specific value,...). While each message is not independent one another, each pair of message should (the handshake that you mentioned), so a more efficient approach would be to evaluate a combination of |
See comments below |
You left a typo on the end of the url, think you meant to share master...ngxson:llama.cpp:xsn/chat_template_prefix_postfix instead of the |
It's actually not that bad. I'm looking at this one because it's easier to read than looking at the diff. I think it's a good start. Why not use a switch? It wouldn't clean it up? Makes me think of that old joke, "AI is nothing but a bunch of 'if' statements" 😅. |
You cannot use switch if inside the Also, |
Why is this an issue? Strings are pointers to allocated memory addresses. You would be jumping from address to the other. Or does C++ suffer from the same issue as C in a switch where it would need to be a integer or a character? If that's the case, could use a mapping or enum? Implement the logic within each case? Just some rough ideas and initial thoughts. |
Yes, see: https://stackoverflow.com/questions/650162/why-cant-the-switch-statement-be-applied-to-strings
That's why the function that you're looking on returns an enum. The functions (get postfix/prefix) using that enum have |
I would've reversed the implementation. I'd need to actually dig into this a bit more. I've been studying the code base in a small snippets once a week. I guess to clarify what I originally meant by the mapping is, "Why not map the enums to strings in an array, kind of like what we did for the models?". e.g. enum llama_chat_template {
LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED = 0,
LLAMA_CHAT_TEMPLATE_CHATML = 1, // Example: teknium/OpenHermes-2.5-Mistral-7B
LLAMA_CHAT_TEMPLATE_LLAMA2 = 2, // Original llama2 template (no <<SYS>> support)
// etc...
}
static const std::map<llama_chat_template, const char *> llama_chat_template_id { // I literally came up with id on the fly... grain of salt.
{ LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED, "unsupported" },
{ LLAMA_CHAT_TEMPLATE_CHATML, "chatml" },
{ LLAMA_CHAT_TEMPLATE_LLAMA2, "llama2" },
// etc...
} This could potentially allow you to do something similar to what you did in I'm not sure why you're using // detect template type
llama_chat_template ttmpl = llama_chat_get_template_type(curr_tmpl.c_str());
if (ttmpl == LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED) {
return -1;
} I see you're passing the template string as an argument here to set Use the enum to get the string which is the id of template. Currently, you have something akin to a reverse/inverse mapping. I'm exhausted right now, but yeah, that's my general line of thought at the moment. What do you think? |
Yeah, I would invert it. enum llama_chat_template {
LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED = 0,
LLAMA_CHAT_TEMPLATE_CHATML = 1, // Example: teknium/OpenHermes-2.5-Mistral-7B
LLAMA_CHAT_TEMPLATE_LLAMA2 = 2, // Original llama2 template (no <<SYS>> support)
// etc...
}
static const std::map<const char *, llama_chat_template> llama_chat_template_id { // I literally came up with id on the fly... grain of salt.
{ "unsupported", LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED },
{ "chatml", LLAMA_CHAT_TEMPLATE_CHATML },
{ "llama2", LLAMA_CHAT_TEMPLATE_LLAMA2 },
// etc...
} Path of least resistance. |
But anw it is code style and not the real implementation, we will see in later stage. For now, what's more interesting is to make it work. |
Hm. Could make it an array of 2 element arrays. In other words, an array of pointers to their respective strings. Each index is mapped accordingly. Same idea. Not as clean or readable as using a map directly, though. Might be better than the current branch implementation. // Define your enum
enum llama_chat_template {
LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED = 0,
LLAMA_CHAT_TEMPLATE_CHATML = 1,
LLAMA_CHAT_TEMPLATE_LLAMA2 = 2,
// etc...
};
// Define an array of character pointers for the strings
static const char* llama_chat_template_id[NUM_TEMPLATES] = {
"unsupported",
"chatml",
"llama2",
// etc...
}; Instead of using a data structure to store the mapping between enums and strings, you can directly use an array of character pointers for the strings and map the index of the enum to the corresponding string in the array. This approach is more memory-efficient as it avoids the overhead of storing extra metadata for each mapping. Here's a brief example: // Use the enum to map the index to the string
const char* template_name = llama_chat_template_id[LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED]; // "unsupported" This approach is simpler and more memory-efficient since it directly maps the enum to its corresponding string using the array index. However, it may be less flexible and less readable compared to using a data structure, as you cannot easily modify or extend the mapping without changing the code. struct llama_chat_template_map {
char* id;
enum llama_chat_template template_enum;
}
#include <cstring>
// Define your struct
struct llama_chat_template_map {
char id[32];
enum llama_chat_template template_enum;
};
// Initialize an array of structures
const size_t num_templates = 3;
static const llama_chat_template_map templates = {
{"unsupported", LLAMA_CHAT_TEMPLATE_NOT_SUPPORTED},
{"chatml", LLAMA_CHAT_TEMPLATE_CHATML},
{"llama2", LLAMA_CHAT_TEMPLATE_LLAMA2},
// etc...
};
// Use the struct in your code
const llama_chat_template_map& curr_tmpl = templates[0];
const char* template_name = curr_tmpl.id;
llama_chat_template ttmpl = curr_tmpl.template_enum; This approach makes the code more readable as the enum and its corresponding string are directly related within the same structure. Credit to Mistral-7B-v0.2 for assisting me with this. That's the best I got at the moment. I'm out of gas. |
I think we should see if we can have a generic chat-apply-template logic, independent of the different handshake-template standards, which inturn is driven by a json file, which provides the required role-prefxes and role-suffixes as well as flags for when system+user message combination is there and whether bos+user-role-tag-prefix should be included inbetween the system and 1st user message or not (a possible flow implemented in the PR) and any others like that... Below is the PR for a initial go at the same thought Idea is to reuse the generic flow as much as possible, as well as allow some amount of controlling of the handshake template insertion using a json without needing to recompile the code, which will also be helpful if a new model's handshake template standard matches the existing generic flow provided. |
I'm a bit confused by your use of deprecate here, would this mean potentially removing support for antiprompt and prefix/postfix in main at some point in the future? |
@Sebby37 I wanted to deprecate the usage of antiprompt, prefix, postfix for chat template in particular, but not for general usages, because:
@hanishkvc Sorry I was on a flight during last 24h, I will (progressively) continue my work when I recover from jetlag. |
I think a variation what you originally proposed in #5922 could handle readability quite well. Core idea is that if prefix/postfix are constrained to only static strings (no conditional logic, fully context-free), then this doesn't need to be a function so much as just a data structure — mapping template roles to a small handful of strings. Dynamic cases like Ex: Ultimately, simple string pre/postfixes would also be easier to expose as configuration. So you could potentially even justify NOT handling the edge cases, as long as users can account for them via config flags / server options. |
Yes, eventually we will implement this idea. Because modern chat templates (chatml, gemma, llama3,...) all work that way (i.e. different prefix for different roles), it's already possible to give the user the possibility to control prefix/postfix - just matter of time before we figure out the best way (and future-proof way) to do so. |
@kaizau FYI, I updated the description of this issue. The issue with prefix/postfix that I mentioned earlier is no longer correct (after some iterations). I'm now more confident that the idea of having dynamic prefix/postfix will work out. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Motivation
From the day I added
llama_chat_apply_template
#5538 , I already started thinking about adding it intomain.cpp
for replacing the current-cml
option. However, it is not as easy as it seems. The main reason is becausemain.cpp
still rely on antiprompt and static prefix/postfix/infix to work with "chat".The whole reason why antiprompt exist in the first place was because in the early era of LLMs:
However, a lot of things changed since then: we're now having the notion of "chat template", newer models have special tokens like
<|user|>
to replaceHuman:
, most models are fine-tuned to stop generation by outputting EOS token...For that reason, using antiprompt and static prefix/postfix/infix is no longer a viable option to add chat template into
main.cpp
. That force us to be a bit more creative.Possible Implementation
Chatml uses
"<|im_start|>" + role + "\n"
as prefix (role
is dynamic based on current message);<|im_end|>\n
is the postfix.This idea is being implemented in Refactor chat template API #6822
llama_token_is_eog()
to replace antiprompt. Additionally, for compatibility reason, we can translate EOG token to antiprompt, because some models output the antiprompt as a sequence of multiple tokens (newer models never do this).Old proposal (outdated)
Possible Implementation
My idea is trying to use
llama_chat_apply_template
inmain.cpp
. This will effectively deprecate antiprompt, prompt prefix/postfix and cml options.Format the chat on-the-go
For now,
llama_chat_apply_template
produce very "additive" result when a new message is added to the list.An additive means for example if I have
[msg1, msg2]
, then I get formatted chatmsg1_msg2
. When I addmsg3
to the list, it must add the formattedmsg3
to the end of the formatted chat, without touching the existing content, it results inmsg1_msg2_msg3
in this example. A wrong result maybemsg1+++msg2_msg3
This is very important. Unlike
server.cpp
where we clear the KV cache and re-format a new prompt each time,main.cpp
add new tokens on top of existing ones, then continues the generation until a condition is met (maybe EOS token or a stop sequence).So, to use
llama_chat_apply_template
inmain.cpp
, a test case must be added to test all chat templates to make sure they are all "additive".main.cpp
can then keep track of a list of messages, re-apply chat template each time and take only the "added" part.Example:
[msg1, msg2]
==> Formatted:<user>msg1<assistant>msg2
[msg1, msg2, msg3]
==> Formatted:<user>msg1<assistant>msg2<user>msg3
==> Part to evaluate:<user>msg3
Manage stop sequences
While it is ideal to use stop token (for example, EOS or
<|im_end|>
) to stop generation, not all models support this (some models still breaks<|im_end|>
into<|
,im
,_end
,|>
), so using stop token is not an option.llama_chat_apply_template
should returns the stop sequence along side with the formatted chat template ==> this is what we need to add.The text was updated successfully, but these errors were encountered: