-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"/v1/chat/completions" tokenization issue #2012
Comments
I did document this issue with the Mistral template in one of the PRs: #1493 (comment) |
The issue seems to be that vllm/vllm/engine/llm_engine.py Line 280 in f375ec8
However, like mentioned above, the vllm/vllm/entrypoints/openai/api_server.py Line 234 in f375ec8
|
I don't see any use in having |
Note that this not only impacts the "/v1/chat/completions" endpoint, but also when using an embedded LLMEngine which gets prompts passed in which have been formatted using |
I just want to note that the special tokens being in the template or not is totally upto the author of the model/template. Most templates don't add special tokens that i've seen. But maybe that is a mistake, and the templates really should be forced to handle the special tokens. I wonder what the best way forward is... That obviously conflicts with the HF default of adding the tokens though. |
True! For the model I was working with (OpenChat 3.5), they had However, the reason I think "it just works" with vanilla HF is because they explicitly set So to be compatible with HF we would have to do something similar... |
It appears that the completion endpoint, The remedy for me was to use the |
Context
The "/v1/chat/completions" endpoint uses the
apply_chat_template
method of the HF tokenizers. It seems to us that these templates take care of adding special tokens (cf. this line from Llama's default template). However, tokenization in vLLM also seems to add special token(s) if this is the tokenizer's default behavior - in particular, the Llama tokenizer adds a BOS token at the start of its tokenization.There are therefore configurations in which the final tokenization will contain more special tokens than necessary.
Repro
In a terminal, launch a vLLM server. For example:
In another terminal, request this server:
Output:
We can see that the prompt token ids start with two 1s instead of one.
This issue also impacts the new
mistralai/Mixtral-8x7B-Instruct-v0.1
model added in the PR #2011The text was updated successfully, but these errors were encountered: