Handle potentially long sequences with DataCollatorForCompletionOnlyLM #644

tannonk · 2023-08-14T14:44:50Z

This PR resolves #643 and provides a patch in DataCollatorForCompletionOnlyLM for handling long sequences which may or may not contain a valid response_template or a valid instruction_template.

For problematic instances where no response_template or instruction_template is found, we set the labels to the ignore_idx. As a result, problematic instances are ignored in the loss computation, but still allow for training to continue.

lvwerra

Looks good to me, thanks for fixing! Just a small nit then we can merge.

tests/test_data_collator_completion_only.py

HuggingFaceDocBuilderDev · 2023-08-15T08:36:36Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra · 2023-08-15T08:39:41Z

Also could you run the code quality tests with make precommit? That should fix the CI.

tannonk · 2023-08-15T11:12:58Z

Hey @lvwerra, thanks for the review. The requested changes have been made and make precommit passes on my end.

lvwerra · 2023-08-15T12:27:28Z

Looks like the tests are not passing, yet.

younesbelkada

Hi @tannonk
Thanks a lot for the PR!
In my opinion, currently raising a RuntimeError really helps for understanding whether things worked correctly or not for users, I am afraid the warning is not a strong enough indicator to users. Maybe it is better to keep it and add a flag allow_ignore_not_matched for advanced users and keep the previous behaviour untouched, what do you think? cc @lvwerra - if that's not a good idea I am happy to merge the PR as it is as well

younesbelkada

Discussed with @lvwerra , indeed if the sequence is too long the training would completely break and this could happen in the middle of a training, which is too odd. Let's merge it! Thanks again @tannonk for your work on this!

huggingface#644) * avoid RuntimeError on long sequences * add unittests and format * remove dependency on external repo * bug fix in DataCollatorForCompletionOnlyLM

tannonk and others added 2 commits August 14, 2023 15:47

avoid RuntimeError on long sequences

b9c047f

add unittests and format

fa20a4f

lvwerra approved these changes Aug 15, 2023

View reviewed changes

tests/test_data_collator_completion_only.py Outdated Show resolved Hide resolved

remove dependency on external repo

639d3de

bug fix in DataCollatorForCompletionOnlyLM

61548b4

younesbelkada reviewed Aug 17, 2023

View reviewed changes

younesbelkada approved these changes Aug 18, 2023

View reviewed changes

younesbelkada merged commit 029f961 into huggingface:main Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle potentially long sequences with DataCollatorForCompletionOnlyLM #644

Handle potentially long sequences with DataCollatorForCompletionOnlyLM #644

tannonk commented Aug 14, 2023

lvwerra left a comment

HuggingFaceDocBuilderDev commented Aug 15, 2023 •

edited

Loading

lvwerra commented Aug 15, 2023

tannonk commented Aug 15, 2023 •

edited

Loading

lvwerra commented Aug 15, 2023

younesbelkada left a comment •

edited

Loading

younesbelkada left a comment

Handle potentially long sequences with DataCollatorForCompletionOnlyLM #644

Handle potentially long sequences with DataCollatorForCompletionOnlyLM #644

Conversation

tannonk commented Aug 14, 2023

lvwerra left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 15, 2023 • edited Loading

lvwerra commented Aug 15, 2023

tannonk commented Aug 15, 2023 • edited Loading

lvwerra commented Aug 15, 2023

younesbelkada left a comment • edited Loading

Choose a reason for hiding this comment

younesbelkada left a comment

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Aug 15, 2023 •

edited

Loading

tannonk commented Aug 15, 2023 •

edited

Loading

younesbelkada left a comment •

edited

Loading