Update incorrect data processing in DataCollatorForChatML #2172

ruijunfeng · 2024-10-04T11:06:01Z

What does this PR do?

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.

Fixes #2169

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case. (Incorrect data processing in DataCollatorForChatML #2169)
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.

…orForChatML Update incorrect data processing in DataCollatorForChatML

kashif · 2024-10-04T11:08:55Z

awesome @ruijunfeng can we also have a test for this?

ruijunfeng · 2024-10-04T11:17:24Z

awesome @ruijunfeng can we also have a test for this?

Sure thing, I have tested it on the instruct-tuned version of Llama2 series and gemma1 series with my own dataset, and it seems to work well. Let me know if you need me to provide anything.

HuggingFaceDocBuilderDev · 2024-10-06T19:24:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kashif · 2024-10-06T19:26:14Z

sorry for the misunderstanding, I meant something like:

class TestDataCollatorForChatML(unittest.TestCase):
    def setUp(self):
        self.tokenizer = AutoTokenizer.from_pretrained("gpt2")
        self.tokenizer.pad_token = self.tokenizer.eos_token
        self.collator = DataCollatorForChatML(tokenizer=self.tokenizer, max_length=20)

    def test_data_collator(self):
        examples = [
            {
                "messages": [
                    {"role": "user", "content": "Hello!"},
                    {"role": "assistant", "content": "Hi there! How can I help you today?"},
                    {"role": "user", "content": "What's the weather like?"},
                    {"role": "assistant", "content": "I'm sorry, but I don't have access to real-time weather information."},
                ]
            },
            {
                "messages": [
                    {"role": "user", "content": "Tell me a joke."},
                    {"role": "assistant", "content": "Why don't scientists trust atoms? Because they make up everything!"},
                ]
            }
        ]

        batch = self.collator(examples)

        self.assertIn("input_ids", batch)
        self.assertIn("attention_mask", batch)
        self.assertIn("labels", batch)
        self.assertIn("prompts", batch)
        self.assertIn("prompt_attention_mask", batch)

        self.assertEqual(batch["input_ids"].shape[0], 2)
        self.assertEqual(batch["attention_mask"].shape[0], 2)
        self.assertEqual(batch["labels"].shape[0], 2)
        self.assertEqual(batch["prompts"].shape[0], 2)
        self.assertEqual(batch["prompt_attention_mask"].shape[0], 2)

        # Check if the shapes are consistent
        self.assertEqual(batch["input_ids"].shape, batch["attention_mask"].shape)
        self.assertEqual(batch["input_ids"].shape, batch["labels"].shape)
        self.assertEqual(batch["prompts"].shape, batch["prompt_attention_mask"].shape)

        # Check if the prompts are shorter than or equal to the full input
        self.assertTrue((batch["prompts"].shape[1] <= batch["input_ids"].shape[1]).all())

so we can explicitly check for the incorrect data processing and the fix you so kindly provided

ruijunfeng · 2024-10-07T01:39:57Z

Hi there, I have run your test code, and I think your test code has a small mistake. You are using the tokenizer for GPT-2:

self.tokenizer = AutoTokenizer.from_pretrained("gpt2")

However, GPT-2 does not have a default chat_template, so it will cause an error in this line of DataCollartorForChatML

self.tokenizer.apply_chat_template(messages, tokenize=False).

I believe the correct way to test this is by manually setting the chat_template for the tokenizer, like this in your setup function:

tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{{ message['role'] }}: {{ message['content'] }}{% endfor %}{{ eos_token }}".

Alternatively, you could use a model that has been fine-tuned on instructions, such as Llama-Instructed, whose tokenizer has a default chat_template.

kashif · 2024-10-07T07:58:01Z

sorry again for the misunderstanding, what i wanted to say is that you can use the above as a template to write the tests in your PR, and also do remember to do make precommit to fix the formatting etc

lewtun

Thanks a lot for the fix @ruijunfeng ! Would you mind adding a unit test which validates the fix works as expected? This will also help ensure future regressions don't leak into the codebase :)

trl/trainer/utils.py

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

kashif · 2024-10-08T09:02:46Z

@lewtun i added a test that fails on main and passes here and @ruijunfeng I pushed it into your PR

…st token of input_ids should be EOS token

tests/test_utils.py

ruijunfeng · 2024-10-08T12:41:20Z

@kashif and @lewtun, thank you both for adding the tests and comments. I’ve double-checked the tests and made updates to the assert statements and comments to improve consistency and clarity. Additionally, I noticed that the current check for the EOS token in input_ids only verifies its presence. I have modified it to ensure that the last token of input_ids is the EOS token for a more thorough check.

kashif · 2024-10-08T14:47:10Z

@qgallouedec fixed the test taking padding into account

tests/test_utils.py

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

tests/test_utils.py

This reverts commit 7e4006c.

kashif · 2024-10-09T11:06:22Z

@qgallouedec I have:

input_ids[15:]  # 15 first tokens are padding
[1, 518, 25580, 29962, 1724, 338, 2253, 1135, 22769, 29973, 518, 29914, 25580, 29962, 25685, 29889, 29871, 2]
(Pdb) self.tokenizer(self.tokenizer.apply_chat_template(self.examples[0]["messages"], tokenize=False), add_special_tokens=False)
{'input_ids': [1, 518, 25580, 29962, 1724, 338, 2253, 1135, 22769, 29973, 518, 29914, 25580, 29962, 25685, 29889, 29871, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

kashif · 2024-10-09T18:33:47Z

@ruijunfeng can you kindly check with the current refactoring of the datacollator, I have simplified it

ruijunfeng · 2024-10-10T03:00:31Z

@kashif Hi there, I still found a small bug in the refactor code. I used the dataset in your unit test and print out the results like this:

>>> tokenizer.decode(data["input_ids"][0])
'<s><s><s><s><s><s><s><s><s><s><s><s><s><s><s><s>user: What is better than ugly?assistant: Beautiful.</s>'
>>> tokenizer.decode(data["prompts"][0])
'<s><s><s><s><s><s><s>user: What is better than ugly?</s>'
>>> data["labels"][0, -6:]
tensor([ -100, 22137, 29901, 25685, 29889,     2])
>>> tokenizer.decode(data["labels"][0, -5:])
'istant: Beautiful.</s>'

Seems like the labels mistakenly wrap part of the "assistant: " and the prompts has missed the "assistant: ". Also from my understanding, isn't prompts shouldn't include EOS token?

kashif · 2024-10-10T06:42:05Z

just trying to reproduce this on my end, I have as output from the data collator:

self.tokenizer.decode(input_ids[0])
'</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s><s> [INST] What is better than ugly? [/INST] Beautiful. </s>'

self.tokenizer.decode(prompts_input_ids[0])
'</s></s></s></s></s></s><s> [INST] What is better than ugly? [/INST]'

and the labels are only set for the completion:

labels[0]
tensor([ -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,
         -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100,  -100, 25685,
        29889, 29871,     2])

self.tokenizer.decode(labels[0][-4:])
'Beautiful. </s>'

at which point are you printing the data from?

tests/test_utils.py

qgallouedec

LGTM, I've just added a minor suggestion

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

ruijunfeng · 2024-10-10T10:30:32Z

@kashif Sorry I used a wrong version of the code. I have tried it again and the refactored code is all good.

* Update incorrect data processing in DataCollatorForChatML Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels. * Update trl/trainer/utils.py Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> * style * move comment * add test for DataCollatorForChatML * update comment with more details * update assert reports and comments, and adds verification that the last token of input_ids should be EOS token * new line at the end of file for code quality * Update tests/test_utils.py * Update tests/test_utils.py * Update tests/test_utils.py * update tests * fix test * Update tests/test_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * Update tests/test_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> * formatting * fix typo * simplify * Revert "simplify" This reverts commit 7e4006c. * tokenize full messages * dont add eos * eos is in the last token * simplify DataCollatorForChatML * Update tests/test_utils.py Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> --------- Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com> Co-authored-by: lewtun <lewis.c.tunstall@gmail.com> Co-authored-by: Quentin Gallouédec <quentin.gallouedec@huggingface.co>

ruijunfeng added 2 commits October 4, 2024 20:42

Update incorrect data processing in DataCollatorForChatML

5c538fb

Fix the extra BOS token and the absence of an EOS token in the returned input_ids, and potentially the absence of a target string in the returned labels.

Merge pull request #1 from ruijunfeng/ruijunfeng-patch-for-DataCollat…

20b50e9

…orForChatML Update incorrect data processing in DataCollatorForChatML

kashif added the 🏋 GKD Related to GKD label Oct 6, 2024

Merge branch 'main' into main

aa50956

lewtun reviewed Oct 7, 2024

View reviewed changes

trl/trainer/utils.py Outdated Show resolved Hide resolved

qgallouedec and others added 5 commits October 7, 2024 17:48

Update trl/trainer/utils.py

d9db42a

Co-authored-by: lewtun <lewis.c.tunstall@gmail.com>

Merge branch 'main' into main

996face

style

55e0155

move comment

78a1850

add test for DataCollatorForChatML

ef2453f

kashif added the 🐛 bug Something isn't working label Oct 8, 2024

update comment with more details

d10acb1

kashif approved these changes Oct 8, 2024

View reviewed changes

ruijunfeng and others added 2 commits October 8, 2024 23:24

update assert reports and comments, and adds verification that the la…

afa84d3

…st token of input_ids should be EOS token

Merge branch 'main' into main

2e20011

kashif reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

new line at the end of file for code quality

017220b

kashif reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

kashif added 2 commits October 8, 2024 14:34

Update tests/test_utils.py

57ebf5f

Update tests/test_utils.py

2d2471b

kashif reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

Update tests/test_utils.py

21fda01

kashif added 2 commits October 8, 2024 16:45

fix test

5dd175f

Merge branch 'main' into main

c09880e

qgallouedec reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Show resolved Hide resolved

qgallouedec reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

qgallouedec reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

kashif and others added 4 commits October 8, 2024 17:16

Update tests/test_utils.py

4c56e5a

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

Update tests/test_utils.py

43aae62

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

formatting

d6686e3

fix typo

ed88690

qgallouedec reviewed Oct 8, 2024

View reviewed changes

tests/test_utils.py Show resolved Hide resolved

kashif added 5 commits October 8, 2024 20:58

simplify

7e4006c

Merge branch 'main' into main

49bc69d

Revert "simplify"

cb7f9be

This reverts commit 7e4006c.

Merge branch 'main' into main

4166ca0

tokenize full messages

9c98fae

kashif self-requested a review October 9, 2024 11:07

kashif added 3 commits October 9, 2024 14:22

dont add eos

7924cc2

eos is in the last token

385bdb0

simplify DataCollatorForChatML

de4ea96

qgallouedec reviewed Oct 10, 2024

View reviewed changes

tests/test_utils.py Outdated Show resolved Hide resolved

qgallouedec approved these changes Oct 10, 2024

View reviewed changes

Update tests/test_utils.py

b4a2e97

Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>

kashif merged commit 3107a40 into huggingface:main Oct 10, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update incorrect data processing in DataCollatorForChatML #2172

Update incorrect data processing in DataCollatorForChatML #2172

ruijunfeng commented Oct 4, 2024

kashif commented Oct 4, 2024

ruijunfeng commented Oct 4, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 6, 2024

kashif commented Oct 6, 2024

ruijunfeng commented Oct 7, 2024

kashif commented Oct 7, 2024 •

edited

Loading

lewtun left a comment

kashif commented Oct 8, 2024

ruijunfeng commented Oct 8, 2024

kashif commented Oct 8, 2024

kashif commented Oct 9, 2024

kashif commented Oct 9, 2024

ruijunfeng commented Oct 10, 2024 •

edited

Loading

kashif commented Oct 10, 2024

qgallouedec left a comment

ruijunfeng commented Oct 10, 2024

Update incorrect data processing in DataCollatorForChatML #2172

Update incorrect data processing in DataCollatorForChatML #2172

Conversation

ruijunfeng commented Oct 4, 2024

What does this PR do?

Before submitting

Who can review?

kashif commented Oct 4, 2024

ruijunfeng commented Oct 4, 2024 • edited Loading

HuggingFaceDocBuilderDev commented Oct 6, 2024

kashif commented Oct 6, 2024

ruijunfeng commented Oct 7, 2024

kashif commented Oct 7, 2024 • edited Loading

lewtun left a comment

Choose a reason for hiding this comment

kashif commented Oct 8, 2024

ruijunfeng commented Oct 8, 2024

kashif commented Oct 8, 2024

kashif commented Oct 9, 2024

kashif commented Oct 9, 2024

ruijunfeng commented Oct 10, 2024 • edited Loading

kashif commented Oct 10, 2024

qgallouedec left a comment

Choose a reason for hiding this comment

ruijunfeng commented Oct 10, 2024

ruijunfeng commented Oct 4, 2024 •

edited

Loading

kashif commented Oct 7, 2024 •

edited

Loading

ruijunfeng commented Oct 10, 2024 •

edited

Loading