-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Server completion streaming returns special tokens as empty strings in chunks #7106
Comments
Actually, the special tokens are not output to the |
This may be related to #6860. |
Did you generate the GGUF yourself, or download it? How old is it? |
It's the reuploaded version, with fixed BPE tokenizer. |
Yes, special tokens are not rendered in |
I would argue that rendering them should be mandatory for the Completion API, since it deals with token generation at a lower level than the Chat API. Therefore, if the model generates a sequence of tokens, these tokens should be visible to the API client. |
I agree. This is especially true for training, finetuning, and testing.
I think making this user-configurable is a good compromise. |
If you render special tokens as text, it will be difficult to distinguish special token from regular text that happens to match token's name/string. When streaming, if the whole name came in one event, it's probably the special token, and if it's broken into multiple events, it's regular text. Without streaming, I don't see any way to distinguish between the cases. A better way would be to return special tokens in a separate field. For streaming, we can add field tokens with an array of tokens that correspond to text in content. When content is empty and tokens field is non-empty, the client will know that it's a special token. When not steaming, we can use the same format that is accepted for prompts – an array with token identifiers and strings. The response to the example in the original report would be:
A client that expects special tokens to be generated should ignore content and process generated field, or however it will be named. Also, I think, you are not supposed to ask Llama 3 to generate special tokens other than |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Version: b2794.
Model: Meta-Llama-3-8B-Instruct-Q8_0.gguf (updated)
Prompt: "<|start_header_id|>user<|end_header_id|>How much is 12 plus 19?<|eot_id|>"
When I run the server and send a completion request with streaming, in the verbose logs I see that the server generates the "<|start_header_id|>", "assistant" and "<|end_header_id|>", followed by "\n\n12 + 19 = 31".
However, the streaming chunks sent by server for <|start_header_id|> and <|end_header_id|> have empty strings as
content
indata
.I couldn't find a config parameter either in the server or in the request that could change this behavior.
The text was updated successfully, but these errors were encountered: