Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend Chat Template Tokenization for Training/Finetuning #27609

Open
siddk opened this issue Nov 20, 2023 · 10 comments
Open

Extend Chat Template Tokenization for Training/Finetuning #27609

siddk opened this issue Nov 20, 2023 · 10 comments
Labels
Feature request Request for a new feature

Comments

@siddk
Copy link
Contributor

siddk commented Nov 20, 2023

Feature request

Extend tokenizer.apply_chat_template with functionality for training/finetuning, returning attention_masks and (optional) labels (for ignoring "System" and "User" messages during loss computation).

I think this requires the following steps:

  • Adding support for taking in a batch of conversations (e.g., List[Conversation := List[Dict[str, str]])
  • Invoking the native tokenizer.__call__() after applying the template to each example (passing through padding, truncation, any other parameters).
  • Important: Adding an optional output for labels -- a "masked" version of the returned input_ids with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set to IGNORE_INDEX = -100).

Motivation

The new tokenizer.apply_chat_template feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.

However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the input_ids (tokens) after applying the chat template.

When finetuning models on chat-based data, it would be really nice to unify the apply_chat_template API with the tokenizer.__call__() API, returning attention_masks and (optionally) labels (with "System" and "User" role text automatically ignored for loss computation).

Your contribution

I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Nov 21, 2023
@ArthurZucker
Copy link
Collaborator

FYI @Rocketknight1

This would also need a support for chat templates in tokenizers IMO

@Rocketknight1
Copy link
Member

Hey @siddk, this definitely seems like a good suggestion, and mirrors suggestions I got from e.g. @philschmid!

The first step is relatively easy - we could just check the first element of the input to figure out if it's a single conversation or a list of them, and the same with the second, although we might have to consider backward compatibility.

The third is tricky, though - I definitely understand why it's important, but given that the templates can be arbitrary, I'm not sure how we can do that automatically for any template!

@khaimt
Copy link
Contributor

khaimt commented Nov 30, 2023

For the third step, I think we need to define the assistant_start_prefix & assistant_stop. Only chat_template is not enough to detect what is the content of Assistant in the prompt. If we know the assistant_start_prefix & assistant_stop, we will unmask all tokens inside: (assistant_prefix, assistant_stop).
For example, Assume that assistant_prefix="\nAssistant:\n" and assistant_stop=""
prompt = "...\nAssistant:\nHi, I am here to help you" --> unmask tokens: "I am here to help you" and mask all other tokens with -100

@Rocketknight1
Copy link
Member

Lost track of this over the holidays, bumping it and putting it back on my list to deal with soon

@Rocketknight1
Copy link
Member

Quick update on this one - after #28945 you can now set return_dict=True when calling apply_chat_template to get other tokenizer outputs like attention mask. I'll add batch support soon, but automatic 'label' masking is trickier. I'll see if I can figure something out!

@Rocketknight1
Copy link
Member

A more complete fix is open at #29222

@haochen806
Copy link

Hi @Rocketknight1 have you figured out labeling mask now?

@radulescupetru
Copy link

@haochen806 , @Rocketknight1 There's a PR for that here: #30650

@sadra-barikbin
Copy link
Contributor

sadra-barikbin commented Aug 31, 2024

Hi @Rocketknight1 , for truncation, the chat special tokens at the end are discarded upon input truncation, as it takes place after the template rendering. Is this the desired behavior for finetuning?

@Rocketknight1
Copy link
Member

Hi @sadra-barikbin, we don't have automatic 'smart' truncation for chats. I guess this would look like discarding whole messages earlier in the chat or something, but there's no clean way to do that that doesn't make assumptions about the data, which we prefer to avoid.

Right now, if you truncate your tokenization after applying a chat template, your output will probably be missing end-of-sequence tokens and the model will generate a continuation of the final truncated message. I advise being very careful with combining truncation and chat templating as a result!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

7 participants