Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Chat Template Support to vLLM #1493

Closed
wants to merge 7 commits into from

Conversation

Tostino
Copy link
Contributor

@Tostino Tostino commented Oct 28, 2023

This pull request introduces the chat template feature to vLLM, utilizing the template stored in the tokenizer, enhancing its compatibility with the OpenAI Chat API.
https://huggingface.co/blog/chat-templates

Key Changes:

  1. Chat Template Integration: Implemented the ability for the vLLM server to utilize a predefined jinja chat template stored in the tokenizer for supported models.
  2. Template Override: Added functionality that lets users provide their custom chat templates using the --chat-template argument.
  3. Documentation Update: The Quickstart guide and related documentation have been updated to instruct users on how to use this new feature.

Benefits:

  1. Replaces FastChat with a more standardized way of handling templates.
  2. New models or templates don't need special support added to another library, they just need to implement the template in the tokenizer.
  3. Fixes issues with formatting being inconsistently applied.

Would appreciate reviews and feedback on this new feature integration.

@Tostino
Copy link
Contributor Author

Tostino commented Oct 28, 2023

I've not touched any of the code that is failing the pylint error.

Edit: I see what happened, fixed.

@flexchar
Copy link

This would be so good. I used Mistral with vLLM and Chat Completion and it was using horribly wrong template from fastchat. It must have been ChatML, it was something with hashtags and roles.

@AguirreNicolas
Copy link
Contributor

I found this functionality really necesary to all. Please consider to support it.

So far I tested with

  • TheBloke/Llama-2-13B-chat-AWQ
  • TheBloke/Mistral-7B-OpenOrca-AWQ
  • TheBloke/zephyr-7B-beta-AWQ

TheBloke/Llama-2-13B-chat-AWQ

The chat_template wasn't in the template_config.json, then I use --chat-template flag pointing to the official chat_template from the meta's template_config.json. It worked properly.

TheBloke/Mistral-7B-OpenOrca-AWQ

The chat_template is already in template_config.json, so the calls were smooth and ok.

TheBloke/zephyr-7B-beta-AWQ

The chat_template is already in template_config.json, BUT when calling to http://localhost:8000/v1/chat/completions, there is a <|assistant|>\n added at the begginin of the content of the.
So I will test it further. Maybe the default chat_template is no properly defined, or maybe is some eror in the code.

Hope you find usefull the feedback.

@timothylimyl
Copy link

timothylimyl commented Nov 1, 2023

I found this functionality really necesary to all. Please consider to support it.

So far I tested with

* TheBloke/Llama-2-13B-chat-AWQ

* TheBloke/Mistral-7B-OpenOrca-AWQ

* TheBloke/zephyr-7B-beta-AWQ

TheBloke/Llama-2-13B-chat-AWQ

The chat_template wasn't in the template_config.json, then I use --chat-template flag pointing to the official chat_template from the meta's template_config.json. It worked properly.

TheBloke/Mistral-7B-OpenOrca-AWQ

The chat_template is already in template_config.json, so the calls were smooth and ok.

TheBloke/zephyr-7B-beta-AWQ

The chat_template is already in template_config.json, BUT when calling to http://localhost:8000/v1/chat/completions, there is a <|assistant|>\n added at the begginin of the content of the. So I will test it further. Maybe the default chat_template is no properly defined, or maybe is some eror in the code.

Hope you find usefull the feedback.

did you manage to add your own custom system message to Zephyr?

I raised an issue here: Here

@Tostino
Copy link
Contributor Author

Tostino commented Nov 1, 2023

@WoosukKwon is there a chance to get a review on this?

@casper-hansen
Copy link
Contributor

One thing that I don’t understand is why chat templates were integrated to begin with. It introduces a lot of untransparent bugs, especially if you use a non-standard format.

How can I use vLLM without any chat template applied where the chat template is applied directly in the prompt that I pass to the engine?

@AguirreNicolas
Copy link
Contributor

AguirreNicolas commented Nov 4, 2023

One thing that I don’t understand is why chat templates were integrated to begin with. It introduces a lot of untransparent bugs, especially if you use a non-standard format.

How can I use vLLM without any chat template applied where the chat template is applied directly in the prompt that I pass to the engine?

@casper-hansen This PR only apply to vllm/entrypoints/openai/api_server.py. Specifically to /v1/chat/completions.
You could use /v1/completions or move to vllm/entrypoints/api_server.py + /generate

# Conflicts:
#	vllm/entrypoints/openai/api_server.py
@dongxiaolong
Copy link

dongxiaolong commented Nov 9, 2023

Hello @WoosukKwon and @zhuohan123

We are facing an urgent issue that requires immediate attention and resolution. I hope you can review and merge the associated PR to address this problem. The issue at hand is causing erroneous and extraneous data to be included in the model's input, such as:

"""

###Human: Got any creative ideas for a 10 year old’s birthday?

Assistant: Of course! Here are some creative ideas for a 10-year-old's birthday party:

  1. Treasure Hunt: Organize a treasure hunt in your backyard or nearby park. Create clues and riddles for the kids to solve, leading them to hidden treasures and surprises.
  2. Science Party: Plan a science-themed party where kids can engage in fun and interactive experiments. You can set up different stations with activities like making slime, erupting volcanoes, or creating simple chemical reactions.
  3. Outdoor Movie Night: Set up a backyard movie night with a projector and a large screen or white sheet. Create a cozy seating area with blankets and pillows, and serve popcorn and snacks while the kids enjoy a favorite movie under the stars.
  4. DIY Crafts Party: Arrange a craft party where kids can unleash their creativity. Provide a variety of craft supplies like beads, paints, and fabrics, and let them create their own unique masterpieces to take home as party favors.
  5. Sports Olympics: Host a mini Olympics event with various sports and games. Set up different stations for activities like sack races, relay races, basketball shooting, and obstacle courses. Give out medals or certificates to the participants.
  6. Cooking Party: Have a cooking-themed party where the kids can prepare their own mini pizzas, cupcakes, or cookies. Provide toppings, frosting, and decorating supplies, and let them get hands-on in the kitchen.
  7. Superhero Training Camp: Create a superhero-themed party where the kids can engage in fun training activities. Set up an obstacle course, have them design their own superhero capes or masks, and organize superhero-themed games and challenges.
  8. Outdoor Adventure: Plan an outdoor adventure party at a local park or nature reserve. Arrange activities like hiking, nature scavenger hunts, or a picnic with games. Encourage exploration and appreciation for the outdoors.
    Remember to tailor the activities to the birthday child's interests and preferences. Have a great celebration!

"""

This unnecessary information introduces significant instability to the user experience. It's critical that we address this to maintain the integrity and reliability of our model.

Thank you for your prompt action on this matter.

#1000 #1169

Copy link
Contributor

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with one comment. Will need @WoosukKwon to signoff

@dongxiaolong
Copy link

dongxiaolong commented Nov 10, 2023

Hi@Tostino. Have you noticed that the current template causes some models to occasionally return tags like <|im_start|>?

@Tostino
Copy link
Contributor Author

Tostino commented Nov 10, 2023

LGTM, with one comment. Will need @WoosukKwon to signoff

I don't know if it's good yet. I looked at things a few nights ago, but am on vacation right now and couldn't spend enough time to verify how it works with multiple models (slow net speed where I am right now).

I don't think it should be passed as a CLI argument. Maybe as part of the API request, but I need some more time to actually test things out.

Reason being, you want to be able to have the endpoint complete responses that are partially started, and I wanted to make sure that is still possible with various templates already in use.

The other thing is, not all templates out there actually follow the spec. I'd say we should make those models change their template to work properly rather than working around the bugs in the template. I was planning on documenting the primary templates used today and the models using them to see what is valid, and what needs more work.

@aarnphm
Copy link
Contributor

aarnphm commented Nov 10, 2023

I don't think it should be passed as a CLI argument. Maybe as part of the API request, but I need some more time to actually test things out.

Chat template should be a before-inference options right? If you are running the models that only works with one chat template then I don't really see the point of changing prompt template on inference

@Tostino
Copy link
Contributor Author

Tostino commented Nov 10, 2023

I don't think it should be passed as a CLI argument. Maybe as part of the API request, but I need some more time to actually test things out.

Chat template should be a before-inference options right? If you are running the models that only works with one chat template then I don't really see the point of changing prompt template on inference

I'm not saying to change the template at inference time, just that it may make sense for some inference requests to need to pass in add_generation_prompt=True and others will want to pass in False if they are wanting a completion of an already-started reply of a specific role (if the model is trained to do that). I wanted to test that prior to locking us into a design that limited our options.

@aarnphm
Copy link
Contributor

aarnphm commented Nov 10, 2023

Yes that makes sense. But I still see the value of setting it at CLI as well, in additional to the ability to change at inference time

@Tostino
Copy link
Contributor Author

Tostino commented Nov 14, 2023

@adamlin120 and @aarnphm
Okay, so I have spent a good bit of time doing analysis on a few different chat templates at this point. Not all of them are bug-free...in-fact none of them are (yet). But I've determined a number of things:

The template used by mistralai/Mistral-7B-Instruct-v0.1:

  • Doesn't respect add_generation_prompt variable.
  • The spacing for the model reply and user reply are inconsistent between the [INST] tags, and the response often starts with a leading space in the content.
  • Adds extra start-token in the template, so there are two start tokens... Unsure if model was trained this way or not (if the model was trained this way, then the template is fine).

The template used by teknium/OpenHermes-2.5-Mistral-7B:

  • add_generation_prompt=False:
    • Fails all cases
    • Response contains the generated <|im_start|> assistant\n as part of the content which needs to be parsed out
    • Response is unable to be continued correctly because <|im_end|>\n is appended to the final assistant response regardless of this setting.
    • The model then generates it's own next response incorrectly with an additional space between the im_start and assistant role name (<|im_start|> assistant\n)
  • add_generation_prompt=True:
    • Works fine if user response was last in the chain.
    • Response from assistant is unable to be continued correctly because <|im_end|>\n<|im_start|>assistant\n is appended to the final assistant response, leading to possibly reduced performance due to divergence from training data.

The template used by Inkbot (inkbot.jinja.txt):

  • add_generation_prompt=False:
    • Works fine for all cases (last message=user, user_context, assistant)
    • Model will complete response of the role, and then end the response without attempting to respond as another role.
  • add_generation_prompt=True:
    • Works fine for all cases (last message=user, user_context, assistant)
    • Will properly insert a new model reply if the last message was not already from the assistant.
    • If the last message was from the assistant, we will properly complete that message without inserting a new reply marker.

That said:

  1. The introduction of "add_generation_prompt" as a request parameter is a necessary enhancement for the API. If the template supports it, it allows users to dictate whether the response should continue from the last message in the conversation or initiate a new "assistant" response. By default, it's set to true, aligning with OpenAI's chat completion API, which typically generates assistant-driven responses. However, when set to false, it enables our API to complete the last response in the list. e.g. completing a users prompt for them (at their request), allowing them to edit it, and then generating a assistant reply. This flexibility is essential for diverse chat scenarios where the assistant's role varies.

  2. In allowing us to complete a response rather than generating an assistant response, there needs to be support for returning the correct role in the response object, so I fixed that.

  3. Also, when you are allowing completions of a response, it makes sense for the API to (optionally) return the full response including the part that was fed in as an input from the client, so the client doesn't need to deal with merging the partial input with the partial response. So I implemented "return_full_response" as a request parameter as well. By default, it's set to false to maintain compatibility with the OpenAI API.

Edit:

  1. I think setting these things from the CLI should be taken care of in a different PR that handles all of them properly.
  2. I searched for fastchat and believe I removed it everywhere already.
  3. @dongxiaolong "Have you noticed that the current template causes some models to occasionally return tags like <|im_start|>" Yes, that is when the template is buggy or you are using it incorrectly. The changes I made now should fix that as long as you are using the template correctly, and have your stop words set.

… parameter controls whether the prompt ends with tokens indicating the start of an assistant message. When set to False (and the template/model supports it) the model should complete the last response in the list. By default, it maintains compatibility with OpenAI's API behavior.

Fixed Role Determination in Responses: Resolved an issue where the role in responses defaulted to "assistant" regardless of context. This fix ensures the response role aligns with the intended conversational participant, enhancing the API's versatility in various chat scenarios.

Introduced 'return_full_response' Request Parameter (Default: False): This parameter, when set to True, negates the need for client-side response merging. It simplifies client integration by providing complete responses even when the client had started the response for the model.
Copy link
Contributor

@aarnphm aarnphm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I don't think return_full_response makes sense here. The purpose of this file is to makes OpenAI compatible API. I don't think people who uses OpenAI will have the expectation to return the generated text with prompt.

My philosophy on this is since it is a compatible server, we shouldn't deviate from the ground truth too much

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
@Tostino
Copy link
Contributor Author

Tostino commented Nov 14, 2023

Also I don't think return_full_response makes sense here. The purpose of this file is to makes OpenAI compatible API. I don't think people who uses OpenAI will have the expectation to return the generated text with prompt.

My philosophy on this is since it is a compatible server, we shouldn't deviate from the ground truth too much

The OpenAI chat completion API doesn't support completions at all, so you can't start a response for the assistant (or the user) and have it finish it. You can do that with plenty of local models, and that is the point of the add_generation_prompt parameter to begin with. When you add that support, it makes sense to make this easier on the clients to support both the existing real OpenAI endpoint which doesn't support this functionality (thus wouldn't benefit from this return_full_response feature) and those who have migrated to local models which support more features.

How would you suggest packing up parameters which are only needed for some models/templates which support them?

Edit: @aarnphm as mentioned in that other PR, since there is echo as a parameter for the legacy completions API, maybe reusing echo for this implementation is better than inventing a new return_full_response parameter. The official API still doesn't support that functionality, but if they ever do echo is much more likely to be the name they use.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 14, 2023

In my testing, I also found inconsistencies in how the streaming api was implemented with the OpenAI implementation. Parts of the json that should have been excluded if there was no data were always being included. I have a fix for that in the works.

…streaming responses after testing how it worked with the regular OpenAI completion API.

Fixed inconsistencies with official OpenAI API, and what we were returning for streaming responses for the chat completion API.

Added error handling so if there is an issue with applying the template, it is reported to the user through an API error, and logged.
@Tostino
Copy link
Contributor Author

Tostino commented Nov 15, 2023

Okay, changes I was working on today were just pushed. Error handling in place. If you have a template that errors, it will report the error properly. Fixed issues with the streaming API where it didn't conform to the spec. Renamed return_full_response to echo, and implemented it for streaming (as I should have originally)

@simon-mo
Copy link
Collaborator

@Tostino thank you for this PR. Do you know whether there are default ChatTemplate for all the model architecture vLLM supports today?

@aarnphm
Copy link
Contributor

aarnphm commented Nov 16, 2023

@Tostino thank you for this PR. Do you know whether there are default ChatTemplate for all the model architecture vLLM supports today?

afaik only mistral, llama, baichuan, stablelm, phi has a default chat templates

@simon-mo
Copy link
Collaborator

In that case, maybe we can use fastchat as a fallback?

@aarnphm
Copy link
Contributor

aarnphm commented Nov 17, 2023

In that case, maybe we can use fastchat as a fallback?

I think by default if any models doesn't have a default chat templates, transformers>4.35.0 will fallback to a default one (IIRC it is the chatml template, or the tokenizer class default chat templates, see https://github.com/huggingface/transformers/blob/b074461ef0f54ce37c5239d30ee960ece28d11ec/src/transformers/tokenization_utils_base.py#L1736)

@Tostino
Copy link
Contributor Author

Tostino commented Nov 19, 2023

@simon-mo here are all the models that have their own default chat templates through HF: https://github.com/search?q=repo%3Ahuggingface/transformers%20default_chat_template&type=code

@Tostino
Copy link
Contributor Author

Tostino commented Nov 20, 2023

Looks like there are now conflicts to resolve after recent merges/release. I'll get that done as soon as I can...

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there's conflict with the recently merged usage PR. Thank you again for doing this and I think this definitely helps simplify things.

vllm/entrypoints/openai/api_server.py Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Show resolved Hide resolved
Comment on lines +655 to +664
if chat_template is not None:
tokenizer.chat_template = chat_template

tmp_template = tokenizer.chat_template or tokenizer.default_chat_template
if tmp_template:
logger.info(f"Chat template:\n{tmp_template}")
else:
logger.warning(
"No chat template loaded, the chat endpoint will be not work.")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if chat_template is not None:
tokenizer.chat_template = chat_template
tmp_template = tokenizer.chat_template or tokenizer.default_chat_template
if tmp_template:
logger.info(f"Chat template:\n{tmp_template}")
else:
logger.warning(
"No chat template loaded, the chat endpoint will be not work.")
if chat_template is not None:
tokenizer.chat_template = chat_template
final_template = tokenizer.chat_template or tokenizer.default_chat_template
if final_template:
logger.info(f"Using chat template:\n{tmp_template}")
else:
raise ValueError("No chat template loaded, the chat endpoint will be not work. There is no default template from the tokenizer. Please explicitly provide one through `--chat-template`."

I would prefer hard failing here to make sure there's a consistent experience as before when fastchat is not found. However this does mean if users want to use completion API only they won't be proceed forward. Perhaps we can add another flag to enable completion API with chat template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you prefer a hard-fail, that's fine with me. I was just going with the approach of returning an error to the user, logging the error, and allowing the other endpoints to work because it seemed like a better user experience. Will go with a hard-fail instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe hard fail and --completion-api-only flag. What do you think? My thinking is that fast failing on the server side is better than users getting errors at runtime for Chat endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do an "endpoints" list as input, with an empty list meaning "all". I figure more endpoints will be added eventually, considering the OpenAI API expansion with agents and all that.

Any opposition?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo So hard-fail here is actually harder than I was hoping. We don't know that the template is broken until we go to apply it at runtime due to a users request. So the template can be totally broken Jinja, and we won't know it until the user is making requests against it.

Do you want to check the template at startup validity using the jinja library directly?

With the changes to transformers to add a default template to the tokenization_utils_base, there really isn't a case I could get it to crash due to a missing tokenizer. I am just going to remove the "else" entirely at this point.

I suppose i'll leave the code in place for:

parser.add_argument("--disable-endpoints",
                        type=json.loads,
                        default=[],
                        help="List of disabled endpoints")

vllm/entrypoints/openai/protocol.py Show resolved Hide resolved
@zhuohan123
Copy link
Member

Do you think it is possible/necessary to keep fastchat as an optional dependency for now? Sometimes people may wanna run chatbots like vicuna, which does not have a template on huggingface.

@simon-mo
Copy link
Collaborator

@zhuohan123 I think alternatively we can provide the Vicuna template in the examples directories so people can download them easily?

@Tostino
Copy link
Contributor Author

Tostino commented Nov 21, 2023

@zhuohan123 I think alternatively we can provide the Vicuna template in the examples directories so people can download them easily?

Completely agree here. I really don't like keeping FastChat as a dependency if we can help it.

If anyone wants to add any like Vicuna to this thread as jinja templates, I'll add them to the examples folder. Ideally, you would open a PR on the HF repo to add the template to the models. That is the long term solve for this.

Edit: a word

@Tostino Tostino closed this Nov 22, 2023
@Tostino Tostino deleted the chat_templates branch November 22, 2023 16:34
@Tostino Tostino restored the chat_templates branch November 22, 2023 16:38
@Tostino Tostino deleted the chat_templates branch November 22, 2023 16:38
@Tostino
Copy link
Contributor Author

Tostino commented Nov 22, 2023

Heh, well...that didn't work.
I had to just do my rebase work in another, freshly created, branch (chat_templates_redux). It simply was way too confusing on the existing branch to resolve the conflicts multiple times for different commits that had conflicts.
I was going to try renaming the old chat_templates branch -> chat_templates_old and then chat_templates_redux -> chat_templates...it ended up closing this PR.

Give me a few... trying to get it fixed if I can, if not I will just open a new PR =(.

@Tostino
Copy link
Contributor Author

Tostino commented Nov 22, 2023

Well crap, from what I read I can't fix at this point, and need to open a new PR: #1756

@simon-mo and @aarnphm ^^^

@@ -70,50 +63,13 @@ async def check_model(request) -> Optional[JSONResponse]:


async def get_gen_prompt(request) -> str:
if not _fastchat_available:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @Tostino this is the previous behaviour.

@Stealthwriter
Copy link

I've been following this for a while, I'm using a fine tuned mistral model.

Everytime I start the open ai server, with any request I get the 10 years old birthday prompt automatically injected.

It's very annoying, did anyone find a solution for this?

@Tostino
Copy link
Contributor Author

Tostino commented Dec 1, 2023

@Stealthwriter
It got merged yesterday (the newer PR).

It'll be in the next release, or you can build/install from main now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.