Extend Chat Template Tokenization for Training/Finetuning #27609

siddk · 2023-11-20T16:03:07Z

Feature request

Extend tokenizer.apply_chat_template with functionality for training/finetuning, returning attention_masks and (optional) labels (for ignoring "System" and "User" messages during loss computation).

I think this requires the following steps:

Adding support for taking in a batch of conversations (e.g., List[Conversation := List[Dict[str, str]])
Invoking the native tokenizer.__call__() after applying the template to each example (passing through padding, truncation, any other parameters).
Important: Adding an optional output for labels -- a "masked" version of the returned input_ids with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set to IGNORE_INDEX = -100).

Motivation

The new tokenizer.apply_chat_template feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.

However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the input_ids (tokens) after applying the chat template.

When finetuning models on chat-based data, it would be really nice to unify the apply_chat_template API with the tokenizer.__call__() API, returning attention_masks and (optionally) labels (with "System" and "User" role text automatically ignored for loss computation).

Your contribution

I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2023-11-21T11:26:40Z

FYI @Rocketknight1

This would also need a support for chat templates in tokenizers IMO

Rocketknight1 · 2023-11-21T12:51:56Z

Hey @siddk, this definitely seems like a good suggestion, and mirrors suggestions I got from e.g. @philschmid!

The first step is relatively easy - we could just check the first element of the input to figure out if it's a single conversation or a list of them, and the same with the second, although we might have to consider backward compatibility.

The third is tricky, though - I definitely understand why it's important, but given that the templates can be arbitrary, I'm not sure how we can do that automatically for any template!

khaimt · 2023-11-30T17:00:14Z

For the third step, I think we need to define the assistant_start_prefix & assistant_stop. Only chat_template is not enough to detect what is the content of Assistant in the prompt. If we know the assistant_start_prefix & assistant_stop, we will unmask all tokens inside: (assistant_prefix, assistant_stop).
For example, Assume that assistant_prefix="\nAssistant:\n" and assistant_stop=""
prompt = "...\nAssistant:\nHi, I am here to help you" --> unmask tokens: "I am here to help you" and mask all other tokens with -100

Rocketknight1 · 2024-01-31T18:02:40Z

Lost track of this over the holidays, bumping it and putting it back on my list to deal with soon

Rocketknight1 · 2024-02-22T13:44:06Z

Quick update on this one - after #28945 you can now set return_dict=True when calling apply_chat_template to get other tokenizer outputs like attention mask. I'll add batch support soon, but automatic 'label' masking is trickier. I'll see if I can figure something out!

Rocketknight1 · 2024-02-22T18:15:45Z

A more complete fix is open at #29222

haochen806 · 2024-07-15T19:52:50Z

Hi @Rocketknight1 have you figured out labeling mask now?

radulescupetru · 2024-07-17T08:52:02Z

@haochen806 , @Rocketknight1 There's a PR for that here: #30650

sadra-barikbin · 2024-08-31T06:27:04Z

Hi @Rocketknight1 , for truncation, the chat special tokens at the end are discarded upon input truncation, as it takes place after the template rendering. Is this the desired behavior for finetuning?

Rocketknight1 · 2024-09-02T13:38:26Z

Hi @sadra-barikbin, we don't have automatic 'smart' truncation for chats. I guess this would look like discarding whole messages earlier in the chat or something, but there's no clean way to do that that doesn't make assumptions about the data, which we prefer to avoid.

Right now, if you truncate your tokenization after applying a chat template, your output will probably be missing end-of-sequence tokens and the model will generate a continuation of the final truncated message. I advise being very careful with combining truncation and chat templating as a result!

ArthurZucker added the Feature request Request for a new feature label Nov 21, 2023

yonigottesman mentioned this issue Aug 12, 2024

Return assistant generated tokens mask in apply_chat_template #30650

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend Chat Template Tokenization for Training/Finetuning #27609

Extend Chat Template Tokenization for Training/Finetuning #27609

siddk commented Nov 20, 2023

ArthurZucker commented Nov 21, 2023

Rocketknight1 commented Nov 21, 2023

khaimt commented Nov 30, 2023

Rocketknight1 commented Jan 31, 2024

Rocketknight1 commented Feb 22, 2024

Rocketknight1 commented Feb 22, 2024

haochen806 commented Jul 15, 2024

radulescupetru commented Jul 17, 2024

sadra-barikbin commented Aug 31, 2024 •

edited

Loading

Rocketknight1 commented Sep 2, 2024

Extend Chat Template Tokenization for Training/Finetuning #27609

Extend Chat Template Tokenization for Training/Finetuning #27609

Comments

siddk commented Nov 20, 2023

Feature request

Motivation

Your contribution

ArthurZucker commented Nov 21, 2023

Rocketknight1 commented Nov 21, 2023

khaimt commented Nov 30, 2023

Rocketknight1 commented Jan 31, 2024

Rocketknight1 commented Feb 22, 2024

Rocketknight1 commented Feb 22, 2024

haochen806 commented Jul 15, 2024

radulescupetru commented Jul 17, 2024

sadra-barikbin commented Aug 31, 2024 • edited Loading

Rocketknight1 commented Sep 2, 2024

sadra-barikbin commented Aug 31, 2024 •

edited

Loading