feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364

NanoCode012 · 2025-02-26T06:15:20Z

Description

Introduce new eot_tokens and train_on_eot fields to extend the capabilities of chat_template prompting to support arbitrary number of delimiters.

Adds token check within jinja template to ensure that the eos_token and eot_token exists.

Motivation and Context

Newer templates use separate tokens for EOT and EOS which causes confusion with the naming and usage of prior fields.

For example, the MistralV7-Tekken formatted prompt below uses [/SYSTEM_PROMPT] , [/INST], </s> to delimit different sections of the prompt.

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]

The current chat_template parser expects only 1 delimiter.

Furthermore, this will allow users to keep the original EOS token when finetuning instead of overwriting it with the EOT tokens

Implementation

The new code would follow the following order:

Check eot_tokens exist in template. If not provided, checks if EOS in template either hardcoded or as variable eos_token.
train_on_eot -> checks for all eot_tokens if train_on_eos is provided, else it'll check for EOS token.
train_on_eos -> only checks for EOS token to mask or unmask

It is assumed that there is only 1 eot_token per turn.

Backwards Compatibility

If train_on_eot and eot_tokens aren't provided, the values will be loaded from train_on_eos and tokenizer.eos_token respectively.

This ensures the same legacy behavior.

If either train_on_eot or eot_tokens are provided, it will not load from the respective variables.

New Errors Thrown

eot_tokens contain eos_token and has conflict between train_on_eot and train_on_eos <- Value Error

EOS tokens not found in chat_template <- Log Warning.
EOT tokens not found in chat_template <- Log Warning
EOT tokens not added to tokenizer <- ValueError

Alternatives

There was an initial discussion to use tokenizer.eos_token: List[str] which applies for generation, however, it's only a str when in other modes.

Discussion / ToDos

Questions:

Should we support multiple eot_tokens per turn? A: Single as that's the common case.
Do we need to force single-token eot_tokens only? A: Yes and keep as-is.
Do we warn if EOT token not appear in chat_template or just debug log? A: Keep as warn for now, no need escalate to error.
How to handle chat_templates (like mistral) with default system prompt + train on system role? We won't be able to detect that role if when turn building, system turn doesn't exist. A: Since this is uncommon case, we can ignore.
Automatic EOT token detection? or pre-populate with common delimiters? A: No, let users specify.
The current implementation feels too heavy / too many extra fields to handle this masking. Could we make it simpler? A: The defaults does not require specifying train_on___ (at most eot_tokens: ), so normal users won't need to mess with this.

Docs:

Update to clarify new fields with examples

Tests to add:

Check conflict eos_token in eot_token and mismatch train_on_eos / train_on_eot
Test backward compatibility that eot_token inherits eos_token
Check token not exist in template
~Test trainable with system role ~ Unneeded as is edge case we don't need to worry about.

NanoCode012 mentioned this pull request Feb 26, 2025

Mistral-Small-3 support #2308

Open

5 tasks

NanoCode012 added 2 commits March 7, 2025 09:42

feat: add eos_tokens and train_on_eot for chat_template EOT parsing

6637e69

fix: comments

d4a88a6

NanoCode012 force-pushed the feat/chat_template_eottokens branch from 95cdeb4 to d4a88a6 Compare March 7, 2025 02:43

NanoCode012 added 4 commits March 7, 2025 17:34

chore: add some examples of tokens

ce2e8ed

feat: add new potential errors for chat_template to faq

b843bd3

feat: add examples for EOT handling

9522fd7

fix: change error to warning for missing EOS

3b0d04a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364

feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364

NanoCode012 commented Feb 26, 2025 •

edited

Loading

feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364

Are you sure you want to change the base?

feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364

Conversation

NanoCode012 commented Feb 26, 2025 • edited Loading

Description

Motivation and Context

Implementation

Backwards Compatibility

New Errors Thrown

Alternatives

Discussion / ToDos

NanoCode012 commented Feb 26, 2025 •

edited

Loading