Handle invalid training_step losses in DDP #5359

tchaton · 2021-01-05T12:42:18Z

What does this PR do?

Fixes #4956, #5243 (DDP)
Fixes #4524 (AMP)

This PR allows choosing how to skip a step when training_step returns NaN, None, or inf in DDP.
We introduce a LightningModule attribute to choose the strategy (via a str/Enum):

This requires torch>=1.7 to work. We chose not to support earlier versions because the amount of work necessary to them is probably not worth, given the number of users interested in this feature.

Strategy 1: "skip_if_any"
process1 -> invalid loss (None or NaN/inf tensor)
process2 -> tensor(0.7)

It will skip the update for all processes (default).

Strategy 2: "never_skip"
process1 -> tensor(...) -> grad_1.1
process2 -> tensor(...) -> grad_2.1

process1 -> invalid loss
process2 -> tensor(...) -> grad_2.2

Decision: Backward without sync on p2 and all_reduce grad after: (grad_1.1 + grad_2.1 + grad_2.2) / 3

TODO:

Before submitting

Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes [if needed]?
Did you write any new necessary tests [no need for typos, docs]?
Did you verify new and existing tests pass locally with your changes?
If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Check that target branch and milestone are aligned!

pep8speaks · 2021-01-05T12:42:28Z

Hello @tchaton! Thanks for updating this PR.

In the file pytorch_lightning/core/lightning.py:

Line 198:121: E501 line too long (125 > 120 characters)

In the file pytorch_lightning/plugins/ddp_comm_hooks/allreduce_invalid.py:

Line 26:121: E501 line too long (140 > 120 characters)
Line 54:121: E501 line too long (154 > 120 characters)

In the file pytorch_lightning/plugins/ddp_plugin.py:

Line 94:111: E231 missing whitespace after ':'
Line 94:121: E501 line too long (125 > 120 characters)

In the file pytorch_lightning/trainer/configuration_validator.py:

Line 89:121: E501 line too long (138 > 120 characters)
Line 117:121: E501 line too long (125 > 120 characters)

In the file pytorch_lightning/utilities/distributed.py:

Line 111:121: E501 line too long (135 > 120 characters)
Line 129:121: E501 line too long (135 > 120 characters)

In the file tests/trainer/optimization/test_automatic_optimization.py:

Line 116:121: E501 line too long (140 > 120 characters)
Line 123:121: E501 line too long (128 > 120 characters)
Line 130:121: E501 line too long (124 > 120 characters)
Line 139:1: E302 expected 2 blank lines, found 0
Line 143:121: E501 line too long (129 > 120 characters)
Line 153:121: E501 line too long (127 > 120 characters)

Comment last updated at 2021-01-11 15:00:28 UTC

codecov · 2021-01-05T15:18:50Z

Codecov Report

Merging #5359 (cae0437) into master (a40e3a3) will decrease coverage by 0%.
The diff coverage is 82%.

@@          Coverage Diff           @@
##           master   #5359   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         134     134           
  Lines        9976   10018   +42     
======================================
+ Hits         9294    9326   +32     
- Misses        682     692   +10

pytorch_lightning/core/callbacks = ControlFlow(callbacks=callba

pytorch_lightning/core/lightning.py

carmocca · 2021-01-05T20:09:57Z

pytorch_lightning/trainer/properties.py

@@ -297,6 +297,11 @@ def distributed_sampler_kwargs(self):

        return kwargs

+    def print_colored_rank(self, msg):


is this for debugging the draft or do you want to merge it?

if so, i'd do it separately

Yes, I will remove it when when the PR is finished and submit a better version in another PR.

pytorch_lightning/plugins/ddp_plugin.py

carmocca · 2021-01-05T21:41:24Z

pytorch_lightning/trainer/training_loop.py

+                    "decision_on_invalid_result ``never_skip`` doesn't support returning None for training_step. "
+                    "Hint: Either switch to ``skip_on_at_least_one`` or return a Nan or Inf loss directly"


I would do this for the user instead of raising an Exception

I would prefer to leave a MisConfigurationException here.

This would be triggered only when people change the property invalid_loss_strategy to return never_skip. I think they should be aware of what is happening and change their code accordingly.
I don't think Lightning should choose between the 2 modes.

I'm talking about return a Nan or Inf loss directly

it might be cumbersome to have to do (in training_step):

if loss is None: return torch.tensor(float('nan'))

I am ok with keeping skip_if_any as the default but when users change the strategy to never_skip, they are already choosing what they want

…/pytorch-lightning into bugfix/loss_nan

edenlightning · 2021-03-02T19:12:56Z

@tchaton is there anything left TODO here or can you close?

carmocca · 2021-03-03T03:59:35Z

@tchaton is there anything left TODO here or can you close?

We need to address all the TODOs mentioned at the top. So many things left

Borda · 2021-05-11T21:51:13Z

@tchaton any update here?

Borda · 2021-09-23T06:48:55Z

seem this quite old PR with several handerts of commits behind the master, consider finishing it or closing as most likely the conflicts will make the PR challenging to finish... 🐰

carmocca · 2021-09-23T12:42:14Z

We can close this for now and use a new PR later.

sailordiary · 2022-01-04T06:01:52Z

Hi, is this issue being tracked elsewhere?

carmocca · 2022-01-04T13:34:39Z

There's the issue links at the top:

Fixes #4956, #5243 (DDP), Fixes #4524 (AMP)

wip

7372a61

tchaton changed the title ~~wip~~ Enable NaN loss on training_step + DDP Jan 5, 2021

tchaton added 5 commits January 5, 2021 13:49

Merge branch 'master' into bugfix/loss_nan

44616bc

resolve

200b02f

typo

25b920d

resolve flake8

acd51b0

update

cae0437

carmocca changed the title ~~Enable NaN loss on training_step + DDP~~ Handle invalid training_step losses in DDP Jan 5, 2021

tchaton added 2 commits January 5, 2021 19:39

wip

1478ae0

remove legacy code

0fb1fef