feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Introduce new
eot_tokens
andtrain_on_eot
fields to extend the capabilities ofchat_template
prompting to support arbitrary number of delimiters.Adds token check within jinja template to ensure that the
eos_token
andeot_token
exists.Motivation and Context
Newer templates use separate tokens for EOT and EOS which causes confusion with the naming and usage of prior fields.
For example, the
MistralV7-Tekken
formatted prompt below uses[/SYSTEM_PROMPT]
,[/INST]
,</s>
to delimit different sections of the prompt.The current chat_template parser expects only 1 delimiter.
Furthermore, this will allow users to keep the original EOS token when finetuning instead of overwriting it with the EOT tokens
Implementation
The new code would follow the following order:
Check
eot_tokens
exist in template. If not provided, checks if EOS in template either hardcoded or as variableeos_token
.train_on_eot
-> checks for alleot_tokens
iftrain_on_eos
is provided, else it'll check for EOS token.train_on_eos
-> only checks for EOS token to mask or unmaskIt is assumed that there is only 1
eot_token
per turn.Backwards Compatibility
If
train_on_eot
andeot_tokens
aren't provided, the values will be loaded fromtrain_on_eos
andtokenizer.eos_token
respectively.This ensures the same legacy behavior.
If either
train_on_eot
oreot_tokens
are provided, it will not load from the respective variables.New Errors Thrown
eot_tokens
containeos_token
and has conflict betweentrain_on_eot
andtrain_on_eos
<- Value ErrorEOS tokens not found in chat_template <- Log Warning.
EOT tokens not found in chat_template <- Log Warning
EOT tokens not added to tokenizer <- ValueError
Alternatives
There was an initial discussion to use
tokenizer.eos_token: List[str]
which applies for generation, however, it's only astr
when in other modes.Discussion / ToDos
Questions:
system
turn doesn't exist. A: Since this is uncommon case, we can ignore.train_on___
(at mosteot_tokens:
), so normal users won't need to mess with this.Docs:
Tests to add: