Fix instruction token masking #1185

mgerstgrasser · 2024-01-06T17:57:39Z

The current code in DataCollatorForCompletionOnlyLM assumes that the first deteced occurence of instruction_template comes before the first detected occurence of response_template. This is reasonable, since in current applications conversations are initiated by the user, not the assistant. However, this can fail if the first instruction is marked differently from all the other instructions, which can if a context-sensitive tokenizer such as Llama-2 tokenizes the instruction_template differently at the start of a string than in the middle.

This PR fixes this by checking if the first detected instruction is after the first detected response. If that is the case, we can assume we missed the first instruction. We then insert a new human_token_ids_idxs starting at 0.

Fixes #1184.

Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.

(in case either of the templates isn't found at all, ...idxs[0] might not exist)

HuggingFaceDocBuilderDev · 2024-01-08T04:58:55Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

younesbelkada

amazing work @mgerstgrasser !
Would you be happy to add a small test here: https://github.com/huggingface/trl/blob/main/tests/test_data_collator_completion_only.py with a test similar than the one you shared on the description of the linked issue?

mgerstgrasser · 2024-01-08T05:15:51Z

amazing work @mgerstgrasser ! Would you be happy to add a small test here: https://github.com/huggingface/trl/blob/main/tests/test_data_collator_completion_only.py with a test similar than the one you shared on the description of the linked issue?

I'll have a look.

younesbelkada · 2024-01-08T05:16:54Z

Thank you very much @mgerstgrasser !

mgerstgrasser · 2024-01-08T16:18:28Z

@younesbelkada

I've added a test! :) That checks that the unmasked text is exactly as expected. The other tests in there don't do that, it seems they just check that nothing crashes, but I don't really see how that would be helpful in this case, since there wasn't any crash to begin with. Let me know in case this isn't what you had in mind though.

Also, I had to slightly change the instruction template for the test, as the way it was done was cutting off the ### from the instruction template, which shouldn't happen I think. (I think even the \n should be part of the template in this case, but that doesn't matter for the test.) I don't know if that would still work as intended with a Llama-2 tokenizer - I've left a note in the code to double check things in case the test is ever switched to that.

younesbelkada

Thanks a lot for the detailed explanation and fix!

* Fix instruction token masking Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response. * Bugfix for edge case (in case either of the templates isn't found at all, ...idxs[0] might not exist) * Add test for instruction masking fix

Fix instruction token masking

f231e27

Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.

mgerstgrasser mentioned this pull request Jan 6, 2024

DataCollatorForCompletionOnlyLM instruction token masking fails if first occurence of instruction is marked differently #1184

Closed

Bugfix for edge case

243e2ef

(in case either of the templates isn't found at all, ...idxs[0] might not exist)

younesbelkada approved these changes Jan 8, 2024

View reviewed changes

Add test for instruction masking fix

4072fa3

younesbelkada approved these changes Jan 9, 2024

View reviewed changes

younesbelkada merged commit 4ae35af into huggingface:main Jan 9, 2024
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix instruction token masking #1185

Fix instruction token masking #1185

mgerstgrasser commented Jan 6, 2024

HuggingFaceDocBuilderDev commented Jan 8, 2024

younesbelkada left a comment

mgerstgrasser commented Jan 8, 2024

younesbelkada commented Jan 8, 2024

mgerstgrasser commented Jan 8, 2024

younesbelkada left a comment

Fix instruction token masking #1185

Fix instruction token masking #1185

Conversation

mgerstgrasser commented Jan 6, 2024

HuggingFaceDocBuilderDev commented Jan 8, 2024

younesbelkada left a comment

Choose a reason for hiding this comment

mgerstgrasser commented Jan 8, 2024

younesbelkada commented Jan 8, 2024

mgerstgrasser commented Jan 8, 2024

younesbelkada left a comment

Choose a reason for hiding this comment