-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ConstantLengthDataset
] Fix packed dataset issue
#452
Conversation
The documentation is not available anymore as the PR was closed or merged. |
docs/source/sft_trainer.mdx
Outdated
@@ -98,6 +98,8 @@ trainer = SFTTrainer( | |||
trainer.train() | |||
``` | |||
|
|||
Note that if you use a packed dataset and if you pass `max_steps` in the training arguments you will probably train your models for more than few epochs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really true, right? It could also be shorter - depends on the dataset and max_steps you choose :)
@@ -423,6 +423,65 @@ def test_data_collator_completion_lm(self): | |||
result_text = self.tokenizer.decode(batch["input_ids"][0, last_pad_idx + 1 :]) | |||
self.assertTrue(result_text == "I have not been masked correctly.") | |||
|
|||
def test_sft_trainer_infinite_with_model(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about double checking here for both cases if we trained as many steps as we expect?
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
…nto fix-packed-dataset
What does this PR do?
Fixes #450
As reported in that issue, there is an inconsistent behaviour when using
packing=True
and the number of iterations showed in the logs of training. In fact, the inconsistency comes from the fact that in the packed case, sequences are packed all together on the fly untilmax_seq_length
is reached, leading to reducing the number of total expected samples.The fix is to force-set the
infinite
argument toTrue
if a user explicity decides to train a model withmax_steps
strategy but properly warn them that the argument has been overriden and warn the user when the dataset is moved to the next iteration.In the case of epochs training strategy, we should force-set infinite argument to False, otherwise the training will run forever.
Added also nice CI tests to make sure that this behavior will stay consistent in future commits. Added also a line in the documentation.
cc @lvwerra @vwxyzjn