Server: use llama_chat_apply_template #5593

ngxson · 2024-02-19T18:44:20Z

This PR replaces the usage --chat-template introduced #5425 . This parameter now accepts a jinja template instead of type name.

If --chat-template is not specified, the default template (taken from model metadata) will be used instead.

This PR also fix the issue where llama_chat_apply_template does not read the metadata correctly.

CC @ggerganov and @cebtenzzre for review. Thank you!

ggerganov · 2024-02-20T08:10:15Z

examples/server/server.cpp

@@ -2390,12 +2391,13 @@ static void server_params_parse(int argc, char **argv, server_params &sparams,
                break;
            }
            std::string value(argv[i]);


value seems unused now?

Yeah it's unused, I forgot to remove that. It's now removed

ggerganov · 2024-02-20T08:10:33Z

examples/server/utils.hpp

-    std::ostringstream output;
-    bool is_inside_turn = false;
+// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
+inline bool verify_custom_template(std::string tmpl) {


Suggested change

inline bool verify_custom_template(std::string tmpl) {

inline bool verify_custom_template(const std::string & tmpl) {

ggerganov · 2024-02-20T08:15:57Z

examples/server/utils.hpp

+    for (size_t i = 0; i < messages.size(); ++i) {
+        auto &curr_msg = messages[i];
+        str[i]          = json_value(curr_msg, "role",    std::string(""));
+        str[i + 1]      = json_value(curr_msg, "content", std::string(""));
+        alloc_size     += str[i + 1].length();
+        chat[i].role    = str[i].c_str();
+        chat[i].content = str[i + 1].c_str();


There seems to be a bug here. Maybe change to str[2*i + 0] = ... and str[2*i + 1] = ...

Thank for notice that. That's why I noticed that the bot's response is quite weird when I test this PR yesterday.

Fixed on c53b34d

Looking at the debug log, I can confirm that the formatted chat is correct:

{"timestamp":1708423580,"level":"VERBOSE","function":"format_chat","line":208,"message":"formatted_chat","text":"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhi, how are you<|im_end|>\n<|im_start|>assistant\n"}

examples/server/server.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson · 2024-02-20T14:14:23Z

I still spot a weird bug where the chat is formatted correctly, but then \u0000 is added when it's tokenized. I changed to draft and investigating that:

{"timestamp":1708438302,"level":"VERBOSE","function":"format_chat","line":208,"message":"formatted_chat","text":"[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\nhi, how are you [/INST]"}
{"timestamp":1708438302,"level":"VERBOSE","function":"start_loop","line":293,"message":"have new task"}
{"timestamp":1708438302,"level":"VERBOSE","function":"start_loop","line":305,"message":"callback_new_task"}
slot 0 is processing [task id: 0]
{"timestamp":1708438302,"level":"VERBOSE","function":"start_loop","line":308,"message":"callback_all_task_finished"}
slot 0 : kv cache rm - [0, end)
{"timestamp":1708438302,"level":"VERBOSE","function":"update_slots","line":1685,"message":"prompt ingested","n_past":0,"cached":"","to_eval":"[INST] <<SYS>>\nYou are a helpful assistant.\n<</SYS>>\n\nhi, how are you [/INST]\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000\u0000"}

Edit: found it! I forgot to buf.resize after receiving the result from llama_chat_apply_template

Works fine now (tested with both chatml + llama2 template)

ibehnam · 2024-02-20T22:00:05Z

@ngxson

Since this is a breaking change, it'd be good to update the server README to mention that the chat-template arg is now different. An example would be nice too.

Also, I found the following message vague. What are the "common" templates?

--chat-template JINJA_TEMPLATE
                            set custom jinja chat template (default: template taken from model's metadata)
                            Note: only commonly used templates are accepted, since we don't have jinja parser

ngxson · 2024-02-21T08:43:57Z

@ibehnam Yeah I forgot about the doc. You're right, in fact, I was thinking about how to make it clear that which templates we support when showing this help, but the problem is that it's depends on llama_chat_apply_template. That function is the one that must be documented.

My idea is that we can add a section in server's doc that show how to use the --chat-template, then include a link to llama_chat_apply_template where user can see a list of supported templates.

* server: use llama_chat_apply_template * server: remove trailing space * server: fix format_chat * server: fix help message Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: fix formatted_chat --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson added 2 commits February 19, 2024 19:39

server: use llama_chat_apply_template

2c225c9

server: remove trailing space

b19f46a

ggerganov reviewed Feb 20, 2024

View reviewed changes

server: fix format_chat

c53b34d

ggerganov approved these changes Feb 20, 2024

View reviewed changes

examples/server/server.cpp Outdated Show resolved Hide resolved

server: fix help message

5912bb5

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

ngxson marked this pull request as draft February 20, 2024 14:13

server: fix formatted_chat

235736b

ngxson marked this pull request as ready for review February 20, 2024 14:21

ngxson merged commit 9c405c9 into ggerganov:master Feb 20, 2024
44 checks passed

infozzdatalabs mentioned this pull request Feb 21, 2024

server: Error\nvector::_M_default_append when using certain models since "llama_chat_apply_template" #5627

Closed

infozzdatalabs mentioned this pull request Feb 21, 2024

Server: fallback to chatml, add AlphaMonarch chat template #5628

Merged

ngxson mentioned this pull request Feb 27, 2024

server : improvements and maintenance #4216

Open

10 tasks

EZForever mentioned this pull request Apr 12, 2024

server: Use llama_chat_apply_template on /completion endpoint #6624

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server: use llama_chat_apply_template #5593

Server: use llama_chat_apply_template #5593

ngxson commented Feb 19, 2024

ggerganov Feb 20, 2024

ngxson Feb 20, 2024

ggerganov Feb 20, 2024

ggerganov Feb 20, 2024

ngxson Feb 20, 2024

ngxson commented Feb 20, 2024 •

edited

Loading

ibehnam commented Feb 20, 2024 •

edited

Loading

ngxson commented Feb 21, 2024

	inline bool verify_custom_template(std::string tmpl) {
	inline bool verify_custom_template(const std::string & tmpl) {

Server: use llama_chat_apply_template #5593

Server: use llama_chat_apply_template #5593

Conversation

ngxson commented Feb 19, 2024

ggerganov Feb 20, 2024

Choose a reason for hiding this comment

ngxson Feb 20, 2024

Choose a reason for hiding this comment

ggerganov Feb 20, 2024

Choose a reason for hiding this comment

ggerganov Feb 20, 2024

Choose a reason for hiding this comment

ngxson Feb 20, 2024

Choose a reason for hiding this comment

ngxson commented Feb 20, 2024 • edited Loading

ibehnam commented Feb 20, 2024 • edited Loading

ngxson commented Feb 21, 2024

ngxson commented Feb 20, 2024 •

edited

Loading

ibehnam commented Feb 20, 2024 •

edited

Loading