Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix instruction token masking #1185

Merged

Conversation

mgerstgrasser
Copy link
Contributor

The current code in DataCollatorForCompletionOnlyLM assumes that the first deteced occurence of instruction_template comes before the first detected occurence of response_template. This is reasonable, since in current applications conversations are initiated by the user, not the assistant. However, this can fail if the first instruction is marked differently from all the other instructions, which can if a context-sensitive tokenizer such as Llama-2 tokenizes the instruction_template differently at the start of a string than in the middle.

This PR fixes this by checking if the first detected instruction is after the first detected response. If that is the case, we can assume we missed the first instruction. We then insert a new human_token_ids_idxs starting at 0.

Fixes #1184.

Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.
(in case either of the templates isn't found at all, ...idxs[0] might not exist)
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amazing work @mgerstgrasser !
Would you be happy to add a small test here: https://github.com/huggingface/trl/blob/main/tests/test_data_collator_completion_only.py with a test similar than the one you shared on the description of the linked issue?

@mgerstgrasser
Copy link
Contributor Author

amazing work @mgerstgrasser ! Would you be happy to add a small test here: https://github.com/huggingface/trl/blob/main/tests/test_data_collator_completion_only.py with a test similar than the one you shared on the description of the linked issue?

I'll have a look.

@younesbelkada
Copy link
Contributor

Thank you very much @mgerstgrasser !

@mgerstgrasser
Copy link
Contributor Author

@younesbelkada

I've added a test! :) That checks that the unmasked text is exactly as expected. The other tests in there don't do that, it seems they just check that nothing crashes, but I don't really see how that would be helpful in this case, since there wasn't any crash to begin with. Let me know in case this isn't what you had in mind though.

Also, I had to slightly change the instruction template for the test, as the way it was done was cutting off the ### from the instruction template, which shouldn't happen I think. (I think even the \n should be part of the template in this case, but that doesn't matter for the test.) I don't know if that would still work as intended with a Llama-2 tokenizer - I've left a note in the code to double check things in case the test is ever switched to that.

Copy link
Contributor

@younesbelkada younesbelkada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the detailed explanation and fix!

@younesbelkada younesbelkada merged commit 4ae35af into huggingface:main Jan 9, 2024
9 checks passed
lapp0 pushed a commit to lapp0/trl that referenced this pull request May 10, 2024
* Fix instruction token masking

Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.

* Bugfix for edge case

(in case either of the templates isn't found at all, ...idxs[0] might not exist)

* Add test for instruction masking fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants