[`ConstantLengthDataset`] Fix packed dataset issue #452

younesbelkada · 2023-06-21T09:58:17Z

What does this PR do?

Fixes #450

As reported in that issue, there is an inconsistent behaviour when using packing=True and the number of iterations showed in the logs of training. In fact, the inconsistency comes from the fact that in the packed case, sequences are packed all together on the fly until max_seq_length is reached, leading to reducing the number of total expected samples.

The fix is to force-set the infinite argument to True if a user explicity decides to train a model with max_steps strategy but properly warn them that the argument has been overriden and warn the user when the dataset is moved to the next iteration.

In the case of epochs training strategy, we should force-set infinite argument to False, otherwise the training will run forever.

Added also nice CI tests to make sure that this behavior will stay consistent in future commits. Added also a line in the documentation.

cc @lvwerra @vwxyzjn

HuggingFaceDocBuilderDev · 2023-06-21T10:02:43Z

The documentation is not available anymore as the PR was closed or merged.

lvwerra · 2023-06-21T10:05:37Z

docs/source/sft_trainer.mdx

@@ -98,6 +98,8 @@ trainer = SFTTrainer(
 trainer.train()
 ```

+Note that if you use a packed dataset and if you pass `max_steps` in the training arguments you will probably train your models for more than few epochs. 


That's not really true, right? It could also be shorter - depends on the dataset and max_steps you choose :)

trl/trainer/sft_trainer.py

trl/trainer/utils.py

lvwerra · 2023-06-21T10:09:36Z

tests/test_sft_trainer.py

@@ -423,6 +423,65 @@ def test_data_collator_completion_lm(self):
        result_text = self.tokenizer.decode(batch["input_ids"][0, last_pad_idx + 1 :])
        self.assertTrue(result_text == "I have not been masked correctly.")

+    def test_sft_trainer_infinite_with_model(self):


What do you think about double checking here for both cases if we trained as many steps as we expect?

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

…nto fix-packed-dataset

fix packed dataset issue

758d450

younesbelkada mentioned this pull request Jun 21, 2023

SFTTrainer finishes before max_steps #450

Closed

younesbelkada requested review from vwxyzjn and lvwerra June 21, 2023 10:01

lvwerra reviewed Jun 21, 2023

View reviewed changes

younesbelkada and others added 4 commits June 21, 2023 12:13

Apply suggestions from code review

09ab3eb

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

address

ea48257

Merge branch 'fix-packed-dataset' of https://github.com/lvwerra/trl i…

d268551

…nto fix-packed-dataset

more docs

5e6713a

younesbelkada requested a review from lvwerra June 21, 2023 10:17

younesbelkada added 2 commits June 21, 2023 10:46

trigger CI

97e496a

fix failing CI

c323e66

lvwerra approved these changes Jun 21, 2023

View reviewed changes

lvwerra mentioned this pull request Jun 21, 2023

Bug when using epochs? #455

Closed

younesbelkada merged commit 33f88ea into main Jun 22, 2023

younesbelkada deleted the fix-packed-dataset branch June 22, 2023 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[`ConstantLengthDataset`] Fix packed dataset issue #452

[`ConstantLengthDataset`] Fix packed dataset issue #452

younesbelkada commented Jun 21, 2023

HuggingFaceDocBuilderDev commented Jun 21, 2023 •

edited

Loading

lvwerra Jun 21, 2023

lvwerra Jun 21, 2023

[ConstantLengthDataset] Fix packed dataset issue #452

[ConstantLengthDataset] Fix packed dataset issue #452

Conversation

younesbelkada commented Jun 21, 2023

What does this PR do?

HuggingFaceDocBuilderDev commented Jun 21, 2023 • edited Loading

lvwerra Jun 21, 2023

Choose a reason for hiding this comment

lvwerra Jun 21, 2023

Choose a reason for hiding this comment

[`ConstantLengthDataset`] Fix packed dataset issue #452

[`ConstantLengthDataset`] Fix packed dataset issue #452

HuggingFaceDocBuilderDev commented Jun 21, 2023 •

edited

Loading