-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Chat Template Support to vLLM #1493
Conversation
… better documentation as well.
I've not touched any of the code that is failing the pylint error. Edit: I see what happened, fixed. |
This would be so good. I used Mistral with vLLM and Chat Completion and it was using horribly wrong template from fastchat. It must have been ChatML, it was something with hashtags and roles. |
I found this functionality really necesary to all. Please consider to support it. So far I tested with
TheBloke/Llama-2-13B-chat-AWQThe chat_template wasn't in the TheBloke/Mistral-7B-OpenOrca-AWQThe TheBloke/zephyr-7B-beta-AWQThe Hope you find usefull the feedback. |
did you manage to add your own custom system message to Zephyr? I raised an issue here: Here |
@WoosukKwon is there a chance to get a review on this? |
One thing that I don’t understand is why chat templates were integrated to begin with. It introduces a lot of untransparent bugs, especially if you use a non-standard format. How can I use vLLM without any chat template applied where the chat template is applied directly in the prompt that I pass to the engine? |
@casper-hansen This PR only apply to |
# Conflicts: # vllm/entrypoints/openai/api_server.py
Hello @WoosukKwon and @zhuohan123 We are facing an urgent issue that requires immediate attention and resolution. I hope you can review and merge the associated PR to address this problem. The issue at hand is causing erroneous and extraneous data to be included in the model's input, such as: """
""" This unnecessary information introduces significant instability to the user experience. It's critical that we address this to maintain the integrity and reliability of our model. Thank you for your prompt action on this matter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with one comment. Will need @WoosukKwon to signoff
Hi@Tostino. Have you noticed that the current template causes some models to occasionally return tags like <|im_start|>? |
I don't know if it's good yet. I looked at things a few nights ago, but am on vacation right now and couldn't spend enough time to verify how it works with multiple models (slow net speed where I am right now). I don't think it should be passed as a CLI argument. Maybe as part of the API request, but I need some more time to actually test things out. Reason being, you want to be able to have the endpoint complete responses that are partially started, and I wanted to make sure that is still possible with various templates already in use. The other thing is, not all templates out there actually follow the spec. I'd say we should make those models change their template to work properly rather than working around the bugs in the template. I was planning on documenting the primary templates used today and the models using them to see what is valid, and what needs more work. |
Chat template should be a before-inference options right? If you are running the models that only works with one chat template then I don't really see the point of changing prompt template on inference |
I'm not saying to change the template at inference time, just that it may make sense for some inference requests to need to pass in add_generation_prompt=True and others will want to pass in False if they are wanting a completion of an already-started reply of a specific role (if the model is trained to do that). I wanted to test that prior to locking us into a design that limited our options. |
Yes that makes sense. But I still see the value of setting it at CLI as well, in additional to the ability to change at inference time |
@adamlin120 and @aarnphm The template used by mistralai/Mistral-7B-Instruct-v0.1:
The template used by teknium/OpenHermes-2.5-Mistral-7B:
The template used by Inkbot (inkbot.jinja.txt):
That said:
Edit:
|
… parameter controls whether the prompt ends with tokens indicating the start of an assistant message. When set to False (and the template/model supports it) the model should complete the last response in the list. By default, it maintains compatibility with OpenAI's API behavior. Fixed Role Determination in Responses: Resolved an issue where the role in responses defaulted to "assistant" regardless of context. This fix ensures the response role aligns with the intended conversational participant, enhancing the API's versatility in various chat scenarios. Introduced 'return_full_response' Request Parameter (Default: False): This parameter, when set to True, negates the need for client-side response merging. It simplifies client integration by providing complete responses even when the client had started the response for the model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I don't think return_full_response makes sense here. The purpose of this file is to makes OpenAI compatible API. I don't think people who uses OpenAI will have the expectation to return the generated text with prompt.
My philosophy on this is since it is a compatible server, we shouldn't deviate from the ground truth too much
The OpenAI chat completion API doesn't support completions at all, so you can't start a response for the assistant (or the user) and have it finish it. You can do that with plenty of local models, and that is the point of the How would you suggest packing up parameters which are only needed for some models/templates which support them? Edit: @aarnphm as mentioned in that other PR, since there is |
In my testing, I also found inconsistencies in how the streaming api was implemented with the OpenAI implementation. Parts of the json that should have been excluded if there was no data were always being included. I have a fix for that in the works. |
…streaming responses after testing how it worked with the regular OpenAI completion API. Fixed inconsistencies with official OpenAI API, and what we were returning for streaming responses for the chat completion API. Added error handling so if there is an issue with applying the template, it is reported to the user through an API error, and logged.
Okay, changes I was working on today were just pushed. Error handling in place. If you have a template that errors, it will report the error properly. Fixed issues with the streaming API where it didn't conform to the spec. Renamed |
@Tostino thank you for this PR. Do you know whether there are default ChatTemplate for all the model architecture vLLM supports today? |
afaik only mistral, llama, baichuan, stablelm, phi has a default chat templates |
In that case, maybe we can use fastchat as a fallback? |
I think by default if any models doesn't have a default chat templates, transformers>4.35.0 will fallback to a default one (IIRC it is the chatml template, or the tokenizer class default chat templates, see https://github.com/huggingface/transformers/blob/b074461ef0f54ce37c5239d30ee960ece28d11ec/src/transformers/tokenization_utils_base.py#L1736) |
@simon-mo here are all the models that have their own default chat templates through HF: https://github.com/search?q=repo%3Ahuggingface/transformers%20default_chat_template&type=code |
Looks like there are now conflicts to resolve after recent merges/release. I'll get that done as soon as I can... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there's conflict with the recently merged usage
PR. Thank you again for doing this and I think this definitely helps simplify things.
if chat_template is not None: | ||
tokenizer.chat_template = chat_template | ||
|
||
tmp_template = tokenizer.chat_template or tokenizer.default_chat_template | ||
if tmp_template: | ||
logger.info(f"Chat template:\n{tmp_template}") | ||
else: | ||
logger.warning( | ||
"No chat template loaded, the chat endpoint will be not work.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if chat_template is not None: | |
tokenizer.chat_template = chat_template | |
tmp_template = tokenizer.chat_template or tokenizer.default_chat_template | |
if tmp_template: | |
logger.info(f"Chat template:\n{tmp_template}") | |
else: | |
logger.warning( | |
"No chat template loaded, the chat endpoint will be not work.") | |
if chat_template is not None: | |
tokenizer.chat_template = chat_template | |
final_template = tokenizer.chat_template or tokenizer.default_chat_template | |
if final_template: | |
logger.info(f"Using chat template:\n{tmp_template}") | |
else: | |
raise ValueError("No chat template loaded, the chat endpoint will be not work. There is no default template from the tokenizer. Please explicitly provide one through `--chat-template`." | |
I would prefer hard failing here to make sure there's a consistent experience as before when fastchat is not found. However this does mean if users want to use completion API only they won't be proceed forward. Perhaps we can add another flag to enable completion API with chat template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you prefer a hard-fail, that's fine with me. I was just going with the approach of returning an error to the user, logging the error, and allowing the other endpoints to work because it seemed like a better user experience. Will go with a hard-fail instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh maybe hard fail and --completion-api-only
flag. What do you think? My thinking is that fast failing on the server side is better than users getting errors at runtime for Chat endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do an "endpoints" list as input, with an empty list meaning "all". I figure more endpoints will be added eventually, considering the OpenAI API expansion with agents and all that.
Any opposition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo So hard-fail here is actually harder than I was hoping. We don't know that the template is broken until we go to apply it at runtime due to a users request. So the template can be totally broken Jinja, and we won't know it until the user is making requests against it.
Do you want to check the template at startup validity using the jinja library directly?
With the changes to transformers to add a default template to the tokenization_utils_base, there really isn't a case I could get it to crash due to a missing tokenizer. I am just going to remove the "else" entirely at this point.
I suppose i'll leave the code in place for:
parser.add_argument("--disable-endpoints",
type=json.loads,
default=[],
help="List of disabled endpoints")
Do you think it is possible/necessary to keep fastchat as an optional dependency for now? Sometimes people may wanna run chatbots like vicuna, which does not have a template on huggingface. |
@zhuohan123 I think alternatively we can provide the Vicuna template in the examples directories so people can download them easily? |
Completely agree here. I really don't like keeping FastChat as a dependency if we can help it. If anyone wants to add any like Vicuna to this thread as jinja templates, I'll add them to the examples folder. Ideally, you would open a PR on the HF repo to add the template to the models. That is the long term solve for this. Edit: a word |
Heh, well...that didn't work. Give me a few... trying to get it fixed if I can, if not I will just open a new PR =(. |
@@ -70,50 +63,13 @@ async def check_model(request) -> Optional[JSONResponse]: | |||
|
|||
|
|||
async def get_gen_prompt(request) -> str: | |||
if not _fastchat_available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @Tostino this is the previous behaviour.
I've been following this for a while, I'm using a fine tuned mistral model. Everytime I start the open ai server, with any request I get the 10 years old birthday prompt automatically injected. It's very annoying, did anyone find a solution for this? |
@Stealthwriter It'll be in the next release, or you can build/install from main now. |
This pull request introduces the chat template feature to vLLM, utilizing the template stored in the tokenizer, enhancing its compatibility with the OpenAI Chat API.
https://huggingface.co/blog/chat-templates
Key Changes:
Benefits:
Would appreciate reviews and feedback on this new feature integration.