-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise an error when lightning replaces an existing sampler #2020
Raise an error when lightning replaces an existing sampler #2020
Conversation
Currently, Trainer replaces the existing sampler with DistributedSampler if running distributing training and `replace_sampler_ddp=True` (default behaviour). If a user has configured an existing sampler, this would lead to widely different results if running a distributed vs non-distributed training. This PR fixes this by raising an Error if user has configured a sampler and uses `replace_sampler_ddp=True`. The recommended behavior from now on is to either remove the sampler or set `replace_sampler_ddp=False`
This comment has been minimized.
This comment has been minimized.
@Borda I didn't realize that |
Hello @devashishshankar! Thanks for updating this PR. There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2020-06-01 20:47:58 UTC |
84175f6
to
dc076ac
Compare
dc076ac
to
2a849c3
Compare
Codecov Report
@@ Coverage Diff @@
## master #2020 +/- ##
======================================
Coverage 88% 88%
======================================
Files 74 74
Lines 4654 4658 +4
======================================
+ Hits 4076 4080 +4
Misses 578 578 |
@@ -113,39 +113,48 @@ def auto_add_sampler(self, dataloader: DataLoader, train: bool) -> DataLoader: | |||
need_dist_sampler = (self.use_ddp or self.use_ddp2 or self.use_horovod or self.use_tpu) | |||
|
|||
if self.replace_sampler_ddp and need_dist_sampler: | |||
if not isinstance(dataloader.sampler, (SequentialSampler, RandomSampler)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this way you also have to include SubsetRandomSampler
and WeightedRandomSampler
I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, shouldn't a warning really be raised if one of these is getting replaced by DistributedSampler? The original motivation of this issue was that I had a WeightedRandomSampler
- and lightning replaced it leading to val acc drop. Took me a long time to debug this. Hence thought that it may be a good idea to raise an Error when an existing sampler is replaced.
Also, if someone uses one of these sampler - won't they get different results if using distributed vs non-distributed training?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you're right about that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 🚀
* Raise an error when lightning replaces an existing sampler Currently, Trainer replaces the existing sampler with DistributedSampler if running distributing training and `replace_sampler_ddp=True` (default behaviour). If a user has configured an existing sampler, this would lead to widely different results if running a distributed vs non-distributed training. This PR fixes this by raising an Error if user has configured a sampler and uses `replace_sampler_ddp=True`. The recommended behavior from now on is to either remove the sampler or set `replace_sampler_ddp=False` * Fix tests * Simpler fix * Fix tests * Make inner method protected * Apply suggestions from code review Co-authored-by: Jirka Borovec <Borda@users.noreply.github.com>
Before submitting
What does this PR do?
Fixes #1995
Currently, Trainer replaces the existing sampler with DistributedSampler
if running distributing training and
replace_sampler_ddp=True
(defaultbehaviour). If a user has configured an existing sampler, this would
lead to widely different results if running a distributed vs
non-distributed training.
This PR fixes this by raising an Error if user has configured a sampler
and uses
replace_sampler_ddp=True
. The recommended behavior from nowon is to either remove the sampler or set
replace_sampler_ddp=False
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃