-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix instruction token masking #1185
Fix instruction token masking #1185
Conversation
Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response.
(in case either of the templates isn't found at all, ...idxs[0] might not exist)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
amazing work @mgerstgrasser !
Would you be happy to add a small test here: https://github.com/huggingface/trl/blob/main/tests/test_data_collator_completion_only.py with a test similar than the one you shared on the description of the linked issue?
I'll have a look. |
Thank you very much @mgerstgrasser ! |
I've added a test! :) That checks that the unmasked text is exactly as expected. The other tests in there don't do that, it seems they just check that nothing crashes, but I don't really see how that would be helpful in this case, since there wasn't any crash to begin with. Let me know in case this isn't what you had in mind though. Also, I had to slightly change the instruction template for the test, as the way it was done was cutting off the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the detailed explanation and fix!
* Fix instruction token masking Fix instruction token masking if the first instruction is tokenized differently than the others, or in general if no instruction is detected before the first response. * Bugfix for edge case (in case either of the templates isn't found at all, ...idxs[0] might not exist) * Add test for instruction masking fix
The current code in DataCollatorForCompletionOnlyLM assumes that the first deteced occurence of
instruction_template
comes before the first detected occurence ofresponse_template
. This is reasonable, since in current applications conversations are initiated by the user, not the assistant. However, this can fail if the first instruction is marked differently from all the other instructions, which can if a context-sensitive tokenizer such as Llama-2 tokenizes the instruction_template differently at the start of a string than in the middle.This PR fixes this by checking if the first detected instruction is after the first detected response. If that is the case, we can assume we missed the first instruction. We then insert a new
human_token_ids_idxs
starting at 0.Fixes #1184.