-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle invalid training_step losses in DDP #5359
Conversation
Hello @tchaton! Thanks for updating this PR.
Comment last updated at 2021-01-11 15:00:28 UTC |
Codecov Report
@@ Coverage Diff @@
## master #5359 +/- ##
======================================
- Coverage 93% 93% -0%
======================================
Files 134 134
Lines 9976 10018 +42
======================================
+ Hits 9294 9326 +32
- Misses 682 692 +10 |
pytorch_lightning/core/callbacks = ControlFlow(callbacks=callba
Outdated
Show resolved
Hide resolved
@@ -297,6 +297,11 @@ def distributed_sampler_kwargs(self): | |||
|
|||
return kwargs | |||
|
|||
def print_colored_rank(self, msg): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this for debugging the draft or do you want to merge it?
if so, i'd do it separately
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will remove it when when the PR is finished and submit a better version in another PR.
"decision_on_invalid_result ``never_skip`` doesn't support returning None for training_step. " | ||
"Hint: Either switch to ``skip_on_at_least_one`` or return a Nan or Inf loss directly" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would do this for the user instead of raising an Exception
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to leave a MisConfigurationException here.
This would be triggered only when people change the property invalid_loss_strategy
to return never_skip
. I think they should be aware of what is happening and change their code accordingly.
I don't think Lightning should choose between the 2 modes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about return a Nan or Inf loss directly
it might be cumbersome to have to do (in training_step
):
if loss is None:
return torch.tensor(float('nan'))
I am ok with keeping skip_if_any
as the default but when users change the strategy to never_skip
, they are already choosing what they want
@tchaton is there anything left TODO here or can you close? |
We need to address all the TODOs mentioned at the top. So many things left |
@tchaton any update here? |
seem this quite old PR with several handerts of commits behind the master, consider finishing it or closing as most likely the conflicts will make the PR challenging to finish... 🐰 |
We can close this for now and use a new PR later. |
Hi, is this issue being tracked elsewhere? |
What does this PR do?
Fixes #4956, #5243 (DDP)
Fixes #4524 (AMP)
This PR allows choosing how to skip a step when training_step returns
NaN
,None
, orinf
in DDP.We introduce a LightningModule attribute to choose the strategy (via a str/Enum):
TODO:
self.log(filter_infinite=True)
_register_comm_hook
)+-Inf
(cant use torch.isfinite`, requires torch 1.6+)manual_optimization
setting.Before submitting
PR review