Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement prompt template for chat completion #717

Open
ehartford opened this issue Sep 14, 2023 · 37 comments
Open

implement prompt template for chat completion #717

ehartford opened this issue Sep 14, 2023 · 37 comments
Labels
enhancement New feature or request

Comments

@ehartford
Copy link

ehartford commented Sep 14, 2023

Is your feature request related to a problem? Please describe.
When generating chat completion, it is hard-coded to generate a non-standard prompt template that looks something like:

### User: <blabla>
### Assistant: <blabla>

system message is currently ignored.

f'### {"Human" if message["role"] == "user" else "Assistant"}:{message["content"]}'

This mostly works for most models. But it's not correct.

Describe the solution you'd like

  1. add a set of built-in prompt templates user can specify at inference time ["vicuna","alpaca","chatml","llama2-chat","oasst"] at minimum
  2. recommend copying design from ooba's instruction templates or fastchat's conversation
  3. add ability to pass a template string for other nonstandard formats (such as the one currently implemented in llama-cpp-python).

Describe alternatives you've considered
modifying llama-cpp-python to hard code it to llama2-chat format, not a great solution.

Additional context

@ehartford
Copy link
Author

This is a hefty task, with architecture / design elements to do it in a clean way. I am busy to take it on myself right now, but in a couple weeks I can try if nobody else has done it.

@abetlen
Copy link
Owner

abetlen commented Sep 14, 2023

Hey @ehartford this is actually something I've had in the backlog and just started last night in #711

My plan is to have those format identifiers and also provide a generic class (?) that users can extend to provide a custom chat template. The challenge is that it's not just the prompt that has to be modified but also stop sequences, grammar (in the case of open ai style function calling chats), and a few more things I probably haven't thought about but I think this is do-able.

Thank you for the resources btw!

@ehartford
Copy link
Author

Awesome thanks! I think it can be done without requiring the user to write any code, using a clever template system, as is implement by ooba and fastchat.

@lexin4ever
Copy link
Contributor

lexin4ever commented Sep 15, 2023

Like workaround. I use /v1/completions method (create_completion function in llama.py instead of create_chat_completion), which allow me to setup any prompt. And I can format message as I want and pass it as string prompt param.

@ehartford
Copy link
Author

true,
but that would require rewriting chatbot-ui to use /completions instead of /chat/completions (or using a reverse proxy to do that)

@ehartford
Copy link
Author

then, my solution is that I will make a proxy that receives calls to /chat/completions, and rewrites them into a call into llama-cpp-python's /completions endpoint, in order to inject the proper prompt format.

@teleprint-me
Copy link
Contributor

teleprint-me commented Sep 15, 2023

I just ran through some rough drafts with GPT.


Proposal for Advanced Customizable Prompts in Chat Completions

Problem Statement

The existing implementation for chat completions uses hard-coded prompts, constraining customization and flexibility. This limitation becomes evident when adapting the code for specific projects or applications that require unique prompt styles or formats.

    PROMPT = chat_history + "### Assistant:"
    PROMPT_STOP = ["### Assistant:", "### Human:"]

Proposed Solution

I propose two new optional parameters, prompt and prompt_stop, to the create_chat_completion method. These will allow users to specify custom prompt and stop token sequences.

    def create_chat_completion(
        # ...existing parameters...
        prompt: Optional[str] = None,
        prompt_stop: Optional[List[str]] = None,
    ):
        # ...
        PROMPT = chat_history + (prompt if prompt else "### Assistant:")
        PROMPT_STOP = prompt_stop if prompt_stop else ["### Assistant:", "### Human:"]
        # ...

Benefits

  1. Enhanced Flexibility: These changes offer users a high level of customization for prompt structures.
  2. Wider Applicability: Extending prompt customization increases the adaptability of the code to different use-cases.
  3. Ease of Use: The optional parameters maintain the cleanliness of the API while offering more freedom to users.

Backward Compatibility

The proposal maintains backward compatibility since both new parameters are optional and will use existing hard-coded values as defaults.

Suggested Defaults

In absence of custom prompts, the system could default to prompts styled after Llama-2's structure as a sane default:

B_INST, E_INST = "[INST]", "[/INST]"
B_SYS, E_SYS = "<<SYS>>\n", "\n<</SYS>>\n\n"
DEFAULT_SYSTEM_PROMPT = """You are a helpful assistant."""

Practical Examples

    my_custom_prompt = ">>> Custom Assistant:"
    my_custom_stop = [">>> Custom Assistant:", ">>> Custom User:"]
    create_chat_completion(
        messages=...,
        prompt=my_custom_prompt,
        prompt_stop=my_custom_stop,
        # ...other parameters...
    )

Related Works

This proposal aims to integrate well with ongoing work in the configurable-chat-templates branch and issues like #711, focusing on handling bos and eos tokens.


I'm not sure if this fits well with what you guys had in mind. Let me know either way. I had the same idea though.

@teleprint-me
Copy link
Contributor

teleprint-me commented Sep 16, 2023

I was looking Open Interpreter and the source code was using litellm. So, I figured I'd take a peek at it and vllm.

I checked out the docs for litellm templates and they have a fairly nice structure for prefixing and postfixing.

# Create your own custom prompt template works 
litellm.register_prompt_template(
        model="togethercomputer/LLaMA-2-7B-32K",
        roles={
            "system": {
                "pre_message": "[INST] <<SYS>>\n",
                "post_message": "\n<</SYS>>\n [/INST]\n"
            },
            "user": {
                "pre_message": "[INST] ",
                "post_message": " [/INST]\n"
            }, 
            "assistant": {
                "post_message": "\n"
            }
        }
    )

def test_huggingface_custom_model():
    model = "huggingface/togethercomputer/LLaMA-2-7B-32K"
    response = completion(model=model, messages=messages, api_base="https://ecd4sb5n09bo4ei2.us-east-1.aws.endpoints.huggingface.cloud")
    print(response['choices'][0]['message']['content'])
    return response

test_huggingface_custom_model()

Found it pretty interesting because you can feed in the structure as a dict and then grab the values by the keys.

I ran through it with GPT again and this is what it came up with as a proof-of-concept.


Revised Proposal for Role-Based Customizable Prompts in Chat Completions

Problem Statement

The current chat completions implementation relies on hard-coded prompts, limiting customization and flexibility. This is a bottleneck when adapting the code to specialized projects requiring unique role-based prompt styles or formats.

Proposed Solution

Replace existing prompt and prompt_stop with a single role_templates parameter to the create_chat_completion method. This will offer users the capability to specify custom role-based formatting for different parts of the conversation.

def create_chat_completion(
    # ...existing parameters...
    role_templates: Optional[Dict[str, Dict[str, str]]] = None
):
    # ...existing code...

Benefits

  1. Enhanced Flexibility: Users gain high levels of customization for role-based prompt structures.
  2. Wider Applicability: The new architecture is adaptable to various chat roles and use-cases.
  3. Ease of Use: Replacing multiple parameters with a single role_templates parameter streamlines the API.

Backward Compatibility

This change maintains backward compatibility since the role_templates parameter is optional and defaults to existing hard-coded values if not provided.

Suggested Defaults

A reasonable default could mirror Llama-2's prompt structure:

DEFAULT_ROLE_TEMPLATES = {
    "system": {
        "pre_message": "[INST] <<SYS>>\n",
        "post_message": "\n<</SYS>>\n [/INST]\n"
    },
    "user": {
        "pre_message": "[INST]",
        "post_message": " [/INST]\n"
    },
    "assistant": {
        "post_message": "\n"
    }
}

Related Works

  • Inspired by role-based prompt templates in litellm.
  • Could also integrate with custom roles specific to llama-cpp-python.

I know it won't be that simple after reviewing the code.

Just wanted to share. Maybe it would inspire something.

@NightMachinery
Copy link

What models actually use the current chat prompt template?

It seems most models use Alpaca's format:

<System prompt/Character Card>

### Instruction:
Your instruction or question here.
For roleplay purposes, I suggest the following - Write <CHAR NAME>'s next reply in a chat between <YOUR NAME> and <CHAR NAME>. Write a single reply only.

### Response:

@teleprint-me
Copy link
Contributor

@NightMachinery

It depends on the dataset and how it's trained and/or finetuned.

The format varies from model to model, but the 2 most popular formats are usually

"### Instruction:" and "### Assistant:"

and

"### Human:" and "Assistant:"

Sometimes it's "### Human:" and "### Bot:"

Open Assistant uses a mixture depending on version and dataset, "prompter:", or "human:", and "assistant:".

Some models are more complex than others, e.g. it's system prompt, input, instruction, and then response.

There's no fixed, or commonly accepted, format yet as far as I can tell.

Most chat models follow system, user, assistant, or some variation. Whether there are tokens that are used to denote which is which depends.

@ehartford
Copy link
Author

The closest thing to standard is ChatML. And it's not widely accepted.

I've adopted it, and open assistant has adopted it. Vicuna and wizardLM haven't.

Hopefully a consensus emerges in the next year.

@teleprint-me
Copy link
Contributor

@ehartford

I'm for ChatML.

The high-level interface is intuitive and easy to reason about and follow.

The low-level interface is similar to what Meta did with Llama-2's chat interface.

The tokens could probably be simplified though. Maybe the use of something more like markup would be an improvement?

<system>System prompt goes here</system>
<user>User prompt goes here</user>
<assistant>

And just "teach" the model that </tag> is always the stop token for that token sequence.

The the output could be parsed similar to XML/HTML.

I'm still learning, so just take what I'm saying with a friendly grain of salt.

This is something I plan on experimenting with if I get the opportunity to do it in the future.

I agree though, a consensus would be nice.

@ehartford
Copy link
Author

ehartford commented Sep 20, 2023

What models actually use the current chat prompt template?

It seems most models use Alpaca's format:

<System prompt/Character Card>

### Instruction:
Your instruction or question here.
For roleplay purposes, I suggest the following - Write <CHAR NAME>'s next reply in a chat between <YOUR NAME> and <CHAR NAME>. Write a single reply only.

### Response:

The current scheme implemented in llama-cpp-python doesn't follow a convention I know of.

Please see the links in my original issue for a comprehensive and detailed list of the currently popular prompt templates.

90%+ of use cases will be covered if the following formats are supported:

  • Llama-2-chat
  • ChatML
  • Vicuna
  • WizardCoder
  • Alpaca
  • OpenAssistant

The best source of documentation on these prompt formats is probably the model cards in TheBloke's distributions which are very well researched.

@abetlen
Copy link
Owner

abetlen commented Sep 30, 2023

Hey @ehartford I just merged in the #711 PR which adds a mechanism to specify common chat formats through a chat_format parameter on the Llama class and the server settings.

Currently supports:

  • llama-2
  • alpaca
  • vicuna
  • oasst_llama
  • openbuddy
  • redpajama-incite
  • snoozy
  • phind
  • open-orca

Let me know if that works for you!

@ehartford
Copy link
Author

Nice!

@ehartford
Copy link
Author

ChatML would be lovely it's garnering more support

@r7l
Copy link

r7l commented Sep 30, 2023

I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable.

For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general.

Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a format_from_yaml like function.

@NightMachinery
Copy link

I've noted this as well and it's great to see support being added just now. But looking at the code, it seems as if there is room to be a bit more flexible and customizable.

For example, in LocalAI they allow people to add Yaml files with a configuration preset for each model. I really like their idea in general.

Maybe it would be an option for the future to have something similar. Instead of having everything being fixed into the code, allow people to add a Yaml file option and pass the content of the file into a format_from_yaml like function.

Or better yet, accept a general lambda as an argument and implement the YAML idea as a specific lambda that can take a YAML file and template the response. E.g.,

chat_template_fn=partial(yaml_template_reader, yaml_path="...")

@teleprint-me
Copy link
Contributor

teleprint-me commented Sep 30, 2023

Anything besides yaml, please 🙏. Simple is always better.

@abetlen
Copy link
Owner

abetlen commented Sep 30, 2023

@ehartford I'll add that and a few others I missed (mistral as well).

@r7l I'll consider this but likely as a utility for the server that converts a config file / template into a chat formatting function.

@bioshazard
Copy link
Contributor

bioshazard commented Oct 1, 2023

Thank you for your work on chat templates and llama-cpp-python generally!!

Curious if you could just universally piggy back the HuggingFace template hub or let users specify a tokenizer_config.json to completely outsource it to this developing standard of rendering arbitrary Jinja provided with the model? I would be surprised if new model releases don't all start coming with their own tokenizer config definition.

EDIT: Delivered the above in linked PR

@LynxPDA
Copy link

LynxPDA commented Oct 16, 2023

@abetlen

The Prompt template: ChatML does not stop generating when using Prompt template: ChatML. I think we need to add a stop token. stop_str = "<|im_end|>" and return ChatFormatterResponse(prompt=_prompt, stop=stop_str)

It worked for me.

@register_chat_format("chatml")
def format_chatml(
    messages: List[llama_types.ChatCompletionRequestMessage],
    **kwargs: Any,
) -> ChatFormatterResponse:
    system_template = """<|im_start|>system
{system_message}"""
    system_message = _get_system_message(messages)
    system_message = system_template.format(system_message=system_message)
    _roles = dict(user="<|im_start|>user", assistant="<|im_start|>assistant")
    _sep = "<|im_end|>"
    stop_str = "<|im_end|>"
    _messages = _map_roles(messages, _roles)
    _messages.append((_roles["assistant"], None))
    _prompt = _format_chatml(system_message, _messages, _sep)
    return ChatFormatterResponse(prompt=_prompt, stop=stop_str)

@bioshazard
Copy link
Contributor

Default stop token would be huge for realizing transparent model provider.

@teleprint-me
Copy link
Contributor

teleprint-me commented Oct 16, 2023

There's a problem with using the stop tokens.

I'm not sure what the difference is yet, but I noticed that using the special tokens in the user facing templates causes a lot of issues.

I would advise not using special tokens at all with llama.cpp. In almost every test I conducted, the models started repeating themselves, derailing, and more.

Using the base template seems to work beautifully though. Not a single issue once I do that.

@bioshazard
Copy link
Contributor

So you exclude them from the template, but still you would set it as a default stop sequence item right? Saves you from having to specify it in the payload. That will be needed to realize total model backing transparency to fully decouple model from chat consumer other than maybe max token in payload

@teleprint-me
Copy link
Contributor

teleprint-me commented Oct 16, 2023

I noted it in my PR on L14.


IMPORTANT NOTES:

  • Omitted for brevity [...]

  • Special tokens are crucial for the model's underlying operations, impacting pre-training, fine-tuning, and low-level inference processes. Users should avoid modifying special tokens to prevent issues in the model's output during inference. These issues may manifest as token fixation, repetitive language patterns, contextual derailment, and hallucinations. Improper use of separators and templates can exacerbate these problems.

Example using the llama-2 model and its templating schema:

  1  <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
  2  [INST] Hello Llama, my name is User. What's your name? [/INST]$
  3  Hello User, my name is Llama. Nice to meet you!$
  4  [INST] What can you do? [/INST]$
  5  I can assist you with various tasks, including providing structured output for certain queries.$
  6  [INST] How can you assist me in my programming projects? [/INST]$
  7  $

This initial example is a proper template format that the model understands. It results in proper output and does not confuse the model.

  1  <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
  2  <s>[INST] Hello Llama, my name is User. What's your name? [/INST]$
  3  Hello User, my name is Llama. Nice to meet you!</s>$
  4  <s>[INST] What can you do? [/INST]$
  5  I can assist you with various tasks, including providing structured output for certain queries.</s>$
  6  <s>[INST] How can you assist me in my programming projects? [/INST]$
  7  $

This example includes the use of special tokens, and the model may or may not use these tokens as a result. The model is not expecting them during inference, which causes unexpected behavior.

  1  <<SYS>>My name is Llama and I am a helpful assistant.<</SYS>>$
  2  $
  3  <s>[INST] Hello Llama, my name is User. What's your name? [/INST]$
  4  Hello User, my name is Llama. Nice to meet you!</s>$
  5  $
  6  <s>[INST] What can you do? [/INST]$
  7  I can assist you with various tasks, including providing structured output for certain queries.</s>$
  8  $
  9  <s>[INST] How can you assist me in my programming projects? [/INST]$
 10  $

This example is improperly formatted and causes the model to become confused. The model begins to fixate on tokens, uses language repetition, and eventually derails.


Note that the $ symbols are substitutes for newline characters, e.g. \n. They're part the output for cat -A.

@Mwni
Copy link

Mwni commented Nov 2, 2023

I propose we use a dedicated library for this: chatformat.
The functionality to format chat prompts is not specific to this project. Creating a shared library will help other developers, but also help us by attracting contributions from outside. It's a win-win.

Additionally, what I'm missing with the current implementation is the possibility to "preface" the model's output.
Prefacing means "putting words into the model's mouth".

The issue is that the current implementation seals off the last incomplete message:

USER: How do you feel?
ASSISTANT: I feel </s>

The </s> should not be there.
Chatformat leaves the last assistant message open by default.

@earonesty
Copy link
Contributor

earonesty commented Nov 3, 2023

we should probably use jinja templates, so the user can specify them at runtime if needed. engines won't know the template to use. users can have models they just finished fine tuning with custom grammars, etc.

this will get everyone what they want.

  • use a well-known "named" template or
  • make one up on the fly

@Mwni
Copy link

Mwni commented Nov 3, 2023

@earonesty Good point about having custom templates. But I think using a templating engine is overcomplicating the matter. These chat formats generally consist of "rounds" that are stacked together.

A round is defined as

  1. A system message (optional)
  2. A user prompt
  3. The model response

We can cover 99% of all possible formats by

  • defining the template string for the first round with system prompt
  • defining the template string for consecutive rounds without system prompt
  • defining how to join rounds

So for example for Alpaca, the format can be defined as:

alpaca:
  with_system: |-
    {system}

    ### Instruction:
    {user}

    ### Response:
    {assistant}</s>

  without_system: |-
    ### Instruction:
    {user}

    ### Response:
    {assistant}</s>

  round_seperator: "\n\n"

If you know of a format that is not covered by this convention, please comment.

abetlen added a commit that referenced this issue Nov 5, 2023
Co-authored-by: Andrei <abetlen@gmail.com>
@teleprint-me
Copy link
Contributor

teleprint-me commented Nov 6, 2023

@Mwni

If you know of a format that is not covered by this convention, please comment.

Llama-1, Llama-2, RedPajama, Mistral, Refact, etc...

@earonesty
Copy link
Contributor

earonesty commented Nov 6, 2023 via email

@Mwni
Copy link

Mwni commented Nov 6, 2023

@teleprint-me The following prompts were generated using the proposed scheme

Llama-2

<s>[INST] <<SYS>>
You are a very clever LLM.
<</SYS>>

Hello? [/INST] Hello.</s><s>[INST] What are you thinking? [/INST] I think that

Vicuna (and Mistral)

You are a very clever LLM.

USER: Hello?
ASSISTANT: Hello.</s>
USER: What are you thinking?
ASSISTANT: I think that

ChatML

<|im_start|>system
You are a very clever LLM.<|im_end|>
<|im_start|>user
Hello?<|im_end|>
<|im_start|>assistant
Hello.<|im_end|>
<|im_start|>user
What are you thinking?<|im_end|>
<|im_start|>assistant
I think that

Where's the problem?

@earonesty
Copy link
Contributor

the problem is that jinja2 is what is sitting in hf config files. so it's future compatible with stuff you haven't heard of, and can be sucked into gguf file metadata, so that the user isn't on the hook to specify a template when working with gguf files. it has a forward compatibility path that matters.

@bioshazard
Copy link
Contributor

Having it load right from the meta data would be killer.

@teleprint-me
Copy link
Contributor

@earonesty

Where in the specification is that? Also, ggerganov already stated he plans on using oblique templates and it will be a minimal, separate, implementation.

@earonesty
Copy link
Contributor

earonesty commented Nov 6, 2023

gguf allows you to store any metadata you want. models at hf have jinja2 templates in their tokenizer configs., so really, it doesn't matter about the specification that much. can just add it to the convert script.

@madprops
Copy link

So can I define a custom format like:

"<|start_header_id|>{name}<|end_header_id|>\n\n"

How? So far I have a dropdown box that selects the pre-defined formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests