-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🚨🚨🚨 Replace DataLoader logic for Accelerate in Trainer, remove unneeded tests 🚨🚨🚨 #24028
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a ton, LGTM!
for _ in train_dataloader: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this still necessary with the RNG reloading?
Edit: ah looks like we are not doing that yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep not yet, so for now it's needed :)
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @muellerzr for all the cool work on using Accelerate for preparing dataloaders, using accelerator.gather_for_metrics
and accelerator.pad_across_processes
for further simplification. 🚀
Left a couple comments regarding few training arguments being ignored.
seed = self.args.data_seed | ||
generator.manual_seed(seed) | ||
|
||
seed = self.args.data_seed if self.args.data_seed is not None else self.args.seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the user specifies self.args.data_seed
or self.args.seed
, it is ignored. Would this hinder reproducible experimentation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests haven't shown it to, because (from what Sylvain and I were able to find), our dataloaders don't play with the same random seed setting that these do/applying generators. Keeping this in as a generator also causes tests to fail, showing that it's detrimental in that sense now with our new DataLoader setup.
Also it's used when __iter__
through the dataset here: https://github.com/huggingface/transformers/blob/main/src/transformers/trainer_pt_utils.py#L665-L667. Because we don't use a torch DistributedSampler
, I don't believe it applies to us here.
If later down the road someone points to a reproducible issue to this, I'll review it again but in my deep dive last week I didn't find an issue.
@@ -3205,72 +3083,42 @@ def evaluation_loop( | |||
|
|||
# Update containers on host | |||
if loss is not None: | |||
losses = self._nested_gather(loss.repeat(batch_size)) | |||
losses_host = losses if losses_host is None else torch.cat((losses_host, losses), dim=0) | |||
losses = self.accelerator.gather_for_metrics((loss.repeat(batch_size))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your responses to the comments and for iterating on the PR. Super cool!
…ed tests 🚨🚨🚨 (huggingface#24028) * Working integration * Fix failing test * Revert label host logic * Bring it back!
Great PR, currently this is breaking my custom collate_fn in the dataloader, still trying to understand why that is. First assumption might be due to multiprocessing? |
@franz101 please open an issue with a reproducer of what you are trying to do so we can help :) |
What does this PR do?
This PR:
DataLoader
in all basic distributed fashions (replacingpl.Loader
for TPU coming in a follow-up PR) to replace it withaccelerator.prepare
tests/trainer/test_trainer.py::TrainerIntegrationTest::test_sampler_seed
, deemed to no longer be necessary to reset the seed, as Accelerate's dataloader setup doesn't need any extra help when iterating/loading back in the seed, regardless of the torch versiontests/trainer/test_trainer.py::TrainerIntegrationTest::test_training_finite_iterable_dataset
, as with Accelerate's new sampler forIterableDataset
we'll actually catch if it'sNone
and raise an error, a new test will be made + clear error message on theAccelerate
side, with a test added toTrainer
afterwards.DataLoaders
all havetotal_batch_size
rather thanbatch_size
test_train_and_eval_dataloaders
andtest_data_is_not_parallelized_when_model_is_parallel
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@sgugger @pacman100