Server completion streaming returns special tokens as empty strings in chunks #7106

Inego · 2024-05-06T18:27:10Z

When I run the server and send a completion request with streaming, in the verbose logs I see that the server generates the "<|start_header_id|>", "assistant" and "<|end_header_id|>", followed by "\n\n12 + 19 = 31".

However, the streaming chunks sent by server for <|start_header_id|> and <|end_header_id|> have empty strings as content in data.

I couldn't find a config parameter either in the server or in the request that could change this behavior.

The text was updated successfully, but these errors were encountered:

Inego · 2024-05-06T19:03:21Z

Actually, the special tokens are not output to the content without streaming as well.

Inego · 2024-05-06T19:19:09Z

This may be related to #6860.

turian · 2024-05-07T00:07:39Z

Did you generate the GGUF yourself, or download it? How old is it?

Inego · 2024-05-07T04:32:37Z

It's the reuploaded version, with fixed BPE tokenizer.
https://huggingface.co/bartowski/Meta-Llama-3-8B-Instruct-GGUF/blob/main/Meta-Llama-3-8B-Instruct-Q8_0.gguf

ggerganov · 2024-05-07T12:13:21Z

This may be related to #6860.

Yes, special tokens are not rendered in server. This can become an user-configurable option

Inego · 2024-05-07T14:29:36Z

Yes, special tokens are not rendered in server. This can become an user-configurable option

I would argue that rendering them should be mandatory for the Completion API, since it deals with token generation at a lower level than the Chat API. Therefore, if the model generates a sequence of tokens, these tokens should be visible to the API client.

teleprint-me · 2024-05-08T19:49:34Z

I would argue that rendering them should be mandatory for the Completion API, since it deals with token generation at a lower level than the Chat API. Therefore, if the model generates a sequence of tokens, these tokens should be visible to the API client.

I agree. This is especially true for training, finetuning, and testing.

Yes, special tokens are not rendered in server. This can become an user-configurable option

I think making this user-configurable is a good compromise.

shibe2 · 2024-06-08T13:50:26Z

If you render special tokens as text, it will be difficult to distinguish special token from regular text that happens to match token's name/string. When streaming, if the whole name came in one event, it's probably the special token, and if it's broken into multiple events, it's regular text. Without streaming, I don't see any way to distinguish between the cases.

A better way would be to return special tokens in a separate field. For streaming, we can add field tokens with an array of tokens that correspond to text in content. When content is empty and tokens field is non-empty, the client will know that it's a special token. When not steaming, we can use the same format that is accepted for prompts – an array with token identifiers and strings. The response to the example in the original report would be:

"content": "assistant\n\n12 + 19 = 31",
"generated": [128006, "assistant", 128007, "\n\n12 + 19 = 31"]

A client that expects special tokens to be generated should ignore content and process generated field, or however it will be named.

Also, I think, you are not supposed to ask Llama 3 to generate special tokens other than eot. You add <|start_header_id|> "assistant" <|end_header_id|> "\n\n" to the end of the prompt, and the model generates just the message content.

github-actions · 2024-07-24T01:06:53Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

Inego added the bug-unconfirmed label May 6, 2024

github-actions bot added the stale label Jun 8, 2024

github-actions bot removed the stale label Jun 9, 2024

github-actions bot added the stale label Jul 9, 2024

github-actions bot closed this as completed Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Server completion streaming returns special tokens as empty strings in chunks #7106

Server completion streaming returns special tokens as empty strings in chunks #7106

Inego commented May 6, 2024

Inego commented May 6, 2024

Inego commented May 6, 2024

turian commented May 7, 2024

Inego commented May 7, 2024 •

edited

Loading

ggerganov commented May 7, 2024

Inego commented May 7, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading

shibe2 commented Jun 8, 2024

github-actions bot commented Jul 24, 2024

Server completion streaming returns special tokens as empty strings in chunks #7106

Server completion streaming returns special tokens as empty strings in chunks #7106

Comments

Inego commented May 6, 2024

Inego commented May 6, 2024

Inego commented May 6, 2024

turian commented May 7, 2024

Inego commented May 7, 2024 • edited Loading

ggerganov commented May 7, 2024

Inego commented May 7, 2024 • edited Loading

teleprint-me commented May 8, 2024 • edited Loading

shibe2 commented Jun 8, 2024

github-actions bot commented Jul 24, 2024

Inego commented May 7, 2024 •

edited

Loading

Inego commented May 7, 2024 •

edited

Loading

teleprint-me commented May 8, 2024 •

edited

Loading