Add `setup_chat_format` for adding new special tokens to model for training chat models #1242

philschmid · 2024-01-17T16:11:41Z

What does this PR do?

This PR adds a new util method setup_chat_format, which automatically defines the chat_template for a tokenizer, adds special tokens, resizes the model embedding layer (optional to a multiple of 64)

It also introduces the ChatMlSpecialTokens dataclass, which is used in the setup_chat_format. This will make it easy to extend to different formats in the future, e.g., llama, but for now, we only add chatml.

Open Discussions

~~Should we add more dummy tokens to the tokenizer when the embedding layer is extended to a multiple of x? This can lead to downstream issues with llama.cpp~~

HuggingFaceDocBuilderDev · 2024-01-17T16:15:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

Thanks a lot ! I left some early comments, in addition can you:

document its usage in the SFTTrainer docs?
and also add a slow test here: https://github.com/huggingface/trl/blob/main/tests/slow/test_sft_slow.py#L36 with an end-to-end example
Thanks!

tests/test_model_utils.py

trl/models/utils.py

tests/test_model_utils.py

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

younesbelkada

Thanks a lot !

docs/source/sft_trainer.mdx

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

…aining chat models (huggingface#1242) * first draft * 64 * sourabs suggestion * wip tests * make style happy * add check * docstring * fix docstring * Update tests/test_model_utils.py Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> * move tests * add todo for abstract class * make style happy * add slow tests and imports * add documentation * sft_trainer.mdx aktualisieren Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> --------- Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

philschmid added 6 commits January 17, 2024 12:58

first draft

009b413

64

c8791e9

sourabs suggestion

47cfc4b

wip tests

3e12ef0

make style happy

1aec827

add check

0506440

younesbelkada reviewed Jan 17, 2024

View reviewed changes

tests/test_model_utils.py Outdated Show resolved Hide resolved

trl/models/utils.py Show resolved Hide resolved

trl/models/utils.py Outdated Show resolved Hide resolved

docstring

ee2a6db

younesbelkada reviewed Jan 17, 2024

View reviewed changes

tests/test_model_utils.py Outdated Show resolved Hide resolved

philschmid and others added 6 commits January 17, 2024 16:20

fix docstring

e4f77a3

Update tests/test_model_utils.py

02f44b8

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

move tests

91421db

add todo for abstract class

bff6e7f

make style happy

2644d74

add slow tests and imports

98ed601

philschmid mentioned this pull request Jan 17, 2024

How does SFTTrainer handle instruction formatted datasets when a tokenizer has no chat_template? #1233

Closed

add documentation

d585834

younesbelkada approved these changes Jan 18, 2024

View reviewed changes

docs/source/sft_trainer.mdx Outdated Show resolved Hide resolved

sft_trainer.mdx aktualisieren

b84b6ca

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

younesbelkada merged commit 928d144 into huggingface:main Jan 18, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `setup_chat_format` for adding new special tokens to model for training chat models #1242

Add `setup_chat_format` for adding new special tokens to model for training chat models #1242

philschmid commented Jan 17, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Jan 17, 2024

younesbelkada left a comment

younesbelkada left a comment

Add setup_chat_format for adding new special tokens to model for training chat models #1242

Add setup_chat_format for adding new special tokens to model for training chat models #1242

Conversation

philschmid commented Jan 17, 2024 • edited Loading

What does this PR do?

Open Discussions

HuggingFaceDocBuilderDev commented Jan 17, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

Add `setup_chat_format` for adding new special tokens to model for training chat models #1242

Add `setup_chat_format` for adding new special tokens to model for training chat models #1242

philschmid commented Jan 17, 2024 •

edited

Loading