Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ConstantLengthDataset] Fix packed dataset issue #452

Merged
merged 7 commits into from
Jun 22, 2023

Conversation

younesbelkada
Copy link
Contributor

What does this PR do?

Fixes #450

As reported in that issue, there is an inconsistent behaviour when using packing=True and the number of iterations showed in the logs of training. In fact, the inconsistency comes from the fact that in the packed case, sequences are packed all together on the fly until max_seq_length is reached, leading to reducing the number of total expected samples.

The fix is to force-set the infinite argument to True if a user explicity decides to train a model with max_steps strategy but properly warn them that the argument has been overriden and warn the user when the dataset is moved to the next iteration.

In the case of epochs training strategy, we should force-set infinite argument to False, otherwise the training will run forever.

Added also nice CI tests to make sure that this behavior will stay consistent in future commits. Added also a line in the documentation.

cc @lvwerra @vwxyzjn

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 21, 2023

The documentation is not available anymore as the PR was closed or merged.

@@ -98,6 +98,8 @@ trainer = SFTTrainer(
trainer.train()
```

Note that if you use a packed dataset and if you pass `max_steps` in the training arguments you will probably train your models for more than few epochs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not really true, right? It could also be shorter - depends on the dataset and max_steps you choose :)

trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved
trl/trainer/sft_trainer.py Outdated Show resolved Hide resolved
trl/trainer/utils.py Outdated Show resolved Hide resolved
@@ -423,6 +423,65 @@ def test_data_collator_completion_lm(self):
result_text = self.tokenizer.decode(batch["input_ids"][0, last_pad_idx + 1 :])
self.assertTrue(result_text == "I have not been masked correctly.")

def test_sft_trainer_infinite_with_model(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about double checking here for both cases if we trained as many steps as we expect?

younesbelkada and others added 4 commits June 21, 2023 12:13
@younesbelkada younesbelkada requested a review from lvwerra June 21, 2023 10:17
@lvwerra lvwerra mentioned this pull request Jun 21, 2023
@younesbelkada younesbelkada merged commit 33f88ea into main Jun 22, 2023
@younesbelkada younesbelkada deleted the fix-packed-dataset branch June 22, 2023 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SFTTrainer finishes before max_steps
3 participants