Skip to content

Commit

Permalink
Repeating an important warning in the chat template docs (huggingface…
Browse files Browse the repository at this point in the history
…#31796)

* Repeating an important warning in the chat template docs

* Update docs/source/en/chat_templating.md

Co-authored-by: Lysandre Debut <hi@lysand.re>

* Reword for clarity

* Reword for clarity

---------

Co-authored-by: Lysandre Debut <hi@lysand.re>
  • Loading branch information
Rocketknight1 and LysandreJik authored Jul 5, 2024
1 parent 1d3eaa6 commit e786844
Showing 1 changed file with 12 additions and 1 deletion.
13 changes: 12 additions & 1 deletion docs/source/en/chat_templating.md
Original file line number Diff line number Diff line change
Expand Up @@ -199,7 +199,8 @@ effect that `add_generation_prompt` has will depend on the template being used.

## Can I use chat templates in training?

Yes! We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you
Yes! This is a good way to ensure that the chat template matches the tokens the model sees during training.
We recommend that you apply the chat template as a preprocessing step for your dataset. After this, you
can simply continue like any other language model training task. When training, you should usually set
`add_generation_prompt=False`, because the added tokens to prompt an assistant response will not be helpful during
training. Let's see an example:
Expand Down Expand Up @@ -233,6 +234,16 @@ The sun.</s>

From here, just continue training like you would with a standard language modelling task, using the `formatted_chat` column.

<Tip>
If you format text with `apply_chat_template(tokenize=False)` and then tokenize it in a separate step, you should set the argument
`add_special_tokens=False`. If you use `apply_chat_template(tokenize=True)`, you don't need to worry about this!

By default, some tokenizers add special tokens like `<bos>` and `<eos>` to text they tokenize. Chat templates should
always include all of the special tokens they need, and so adding extra special tokens with
the default `add_special_tokens=True` can result in incorrect or duplicated special tokens, which will hurt model
performance.
</Tip>

## Advanced: Extra inputs to chat templates

The only argument that `apply_chat_template` requires is `messages`. However, you can pass any keyword
Expand Down

0 comments on commit e786844

Please sign in to comment.