Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add eos_tokens and train_on_eot for chat_template EOT parsing #2364

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

NanoCode012
Copy link
Collaborator

@NanoCode012 NanoCode012 commented Feb 26, 2025

Description

Introduce new eot_tokens and train_on_eot fields to extend the capabilities of chat_template prompting to support arbitrary number of delimiters.

Adds token check within jinja template to ensure that the eos_token and eot_token exists.

Motivation and Context

Newer templates use separate tokens for EOT and EOS which causes confusion with the naming and usage of prior fields.

For example, the MistralV7-Tekken formatted prompt below uses [/SYSTEM_PROMPT] , [/INST], </s> to delimit different sections of the prompt.

<s>[SYSTEM_PROMPT]<system prompt>[/SYSTEM_PROMPT][INST]<user message>[/INST]<assistant response></s>[INST]<user message>[/INST]

The current chat_template parser expects only 1 delimiter.

Furthermore, this will allow users to keep the original EOS token when finetuning instead of overwriting it with the EOT tokens

Implementation

The new code would follow the following order:

  1. Check eot_tokens exist in template. If not provided, checks if EOS in template either hardcoded or as variable eos_token.

  2. train_on_eot -> checks for all eot_tokens if train_on_eos is provided, else it'll check for EOS token.

  3. train_on_eos -> only checks for EOS token to mask or unmask

It is assumed that there is only 1 eot_token per turn.

Backwards Compatibility

If train_on_eot and eot_tokens aren't provided, the values will be loaded from train_on_eos and tokenizer.eos_token respectively.

This ensures the same legacy behavior.

If either train_on_eot or eot_tokens are provided, it will not load from the respective variables.

New Errors Thrown

eot_tokens contain eos_token and has conflict between train_on_eot and train_on_eos <- Value Error

EOS tokens not found in chat_template <- Log Warning.
EOT tokens not found in chat_template <- Log Warning
EOT tokens not added to tokenizer <- ValueError

Alternatives

There was an initial discussion to use tokenizer.eos_token: List[str] which applies for generation, however, it's only a str when in other modes.

Discussion / ToDos

Questions:

  • Should we support multiple eot_tokens per turn? A: Single as that's the common case.
  • Do we need to force single-token eot_tokens only? A: Yes and keep as-is.
  • Do we warn if EOT token not appear in chat_template or just debug log? A: Keep as warn for now, no need escalate to error.
  • How to handle chat_templates (like mistral) with default system prompt + train on system role? We won't be able to detect that role if when turn building, system turn doesn't exist. A: Since this is uncommon case, we can ignore.
  • Automatic EOT token detection? or pre-populate with common delimiters? A: No, let users specify.
  • The current implementation feels too heavy / too many extra fields to handle this masking. Could we make it simpler? A: The defaults does not require specifying train_on___ (at most eot_tokens: ), so normal users won't need to mess with this.

Docs:

  • Update to clarify new fields with examples

Tests to add:

  • Check conflict eos_token in eot_token and mismatch train_on_eos / train_on_eot
  • Test backward compatibility that eot_token inherits eos_token
  • Check token not exist in template
  • ~Test trainable with system role ~ Unneeded as is edge case we don't need to worry about.

@NanoCode012 NanoCode012 mentioned this pull request Feb 26, 2025
5 tasks
@NanoCode012 NanoCode012 force-pushed the feat/chat_template_eottokens branch from 95cdeb4 to d4a88a6 Compare March 7, 2025 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant