-
Notifications
You must be signed in to change notification settings - Fork 27.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend Chat Template Tokenization for Training/Finetuning #27609
Comments
FYI @Rocketknight1 This would also need a support for chat templates in tokenizers IMO |
Hey @siddk, this definitely seems like a good suggestion, and mirrors suggestions I got from e.g. @philschmid! The first step is relatively easy - we could just check the first element of the input to figure out if it's a single conversation or a list of them, and the same with the second, although we might have to consider backward compatibility. The third is tricky, though - I definitely understand why it's important, but given that the templates can be arbitrary, I'm not sure how we can do that automatically for any template! |
For the third step, I think we need to define the assistant_start_prefix & assistant_stop. Only chat_template is not enough to detect what is the content of Assistant in the prompt. If we know the assistant_start_prefix & assistant_stop, we will unmask all tokens inside: (assistant_prefix, assistant_stop). |
Lost track of this over the holidays, bumping it and putting it back on my list to deal with soon |
Quick update on this one - after #28945 you can now set |
A more complete fix is open at #29222 |
Hi @Rocketknight1 have you figured out |
@haochen806 , @Rocketknight1 There's a PR for that here: #30650 |
Hi @Rocketknight1 , for truncation, the chat special tokens at the end are discarded upon input truncation, as it takes place after the template rendering. Is this the desired behavior for finetuning? |
Hi @sadra-barikbin, we don't have automatic 'smart' truncation for chats. I guess this would look like discarding whole messages earlier in the chat or something, but there's no clean way to do that that doesn't make assumptions about the data, which we prefer to avoid. Right now, if you truncate your tokenization after applying a chat template, your output will probably be missing end-of-sequence tokens and the model will generate a continuation of the final truncated message. I advise being very careful with combining truncation and chat templating as a result! |
Feature request
Extend
tokenizer.apply_chat_template
with functionality for training/finetuning, returningattention_masks
and (optional)labels
(for ignoring "System" and "User" messages during loss computation).I think this requires the following steps:
List[Conversation := List[Dict[str, str]]
)tokenizer.__call__()
after applying the template to each example (passing through padding, truncation, any other parameters).labels
-- a "masked" version of the returnedinput_ids
with tokens corresponding to the System/User roles set to be ignored for loss computation (e.g., set toIGNORE_INDEX = -100
).Motivation
The new
tokenizer.apply_chat_template
feature is great, and resolves a lot of ambiguity when it comes to formatting inputs for chat-based LLMs.However, right now it's geared for inference-time usage, only taking a single "conversation" and outputting the
input_ids
(tokens) after applying the chat template.When finetuning models on chat-based data, it would be really nice to unify the
apply_chat_template
API with thetokenizer.__call__()
API, returningattention_masks
and (optionally)labels
(with "System" and "User" role text automatically ignored for loss computation).Your contribution
I can try building a proof-of-concept for a "standard" workflow and Draft PR; I think there'd need to be a few discussions about the actual implementation details though!
The text was updated successfully, but these errors were encountered: