If you render special tokens as text, it will be difficult to distinguish special token from regular text that happens to match token's name/string. When streaming, if the whole name came in one event, it's probably the special token, and if it's broken into multiple events, it's regular text. Without streaming, I don't see any way to distinguish between the cases.

A better way would be to return special tokens in a separate field. For streaming, we can add field tokens with an array of tokens that correspond to text in content. When content is empty and tokens field is non-empty, the client will know that it's a special token. When not steaming, we can use the same format that is accepted for prompts – an array with token identifiers and strings. The response to the example in the original report would be:

"content": "assistant\n\n12 + 19 = 31",
"generated": [128006, "assistant", 128007, "\n\n12 + 19 = 31"]

A client that expects special tokens to be generated should ignore content and process generated field, or however it will be named.

Also, I think, you are not supposed to ask Llama 3 to generate special tokens other than eot. You add <|start_header_id|> "assistant" <|end_header_id|> "\n\n" to the end of the prompt, and the model generates just the message content.

Server completion streaming returns special tokens as empty strings in chunks #7106

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions