Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle invalid training_step losses in DDP #5359

Closed
wants to merge 39 commits into from

Conversation

tchaton
Copy link
Contributor

@tchaton tchaton commented Jan 5, 2021

What does this PR do?

Fixes #4956, #5243 (DDP)
Fixes #4524 (AMP)

This PR allows choosing how to skip a step when training_step returns NaN, None, or inf in DDP.
We introduce a LightningModule attribute to choose the strategy (via a str/Enum):

This requires torch>=1.7 to work. We chose not to support earlier versions because the amount of work necessary to them is probably not worth, given the number of users interested in this feature.

Strategy 1: "skip_if_any"
process1 -> invalid loss (None or NaN/inf tensor)
process2 -> tensor(0.7)

It will skip the update for all processes (default). 
Strategy 2: "never_skip"
process1 -> tensor(...) -> grad_1.1
process2 -> tensor(...) -> grad_2.1

process1 -> invalid loss
process2 -> tensor(...) -> grad_2.2

Decision: Backward without sync on p2 and all_reduce grad after: (grad_1.1 + grad_2.1 + grad_2.2) / 3

TODO:

  • Weighted Reduction with accumulate_grad_batches = 1
  • Weighted Reduction with accumulate_grad_batches > 1
  • Test with ddp_spawn
  • Test with ddp_cpu, ddp_spawn.
  • Test with ddp_sharded
  • Resolve Logging being NaN in progress_bar. self.log(filter_infinite=True)
  • Support AMP
  • Support APEX?
  • Test with Horovod (doesn't support _register_comm_hook )
  • Support +-Inf (cant use torch.isfinite`, requires torch 1.6+)
  • Add parity test to make sure gradient reduction is done properly (accumulate_grad_batches 1, 2)
  • Separate test for both strategies
  • Train real model to check training works works fine.
  • Check behaviour in manual_optimization setting.
  • Move to 1.2-dev as it is more a new feature.
  • Test misconfigs

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together?
  • Did you make sure to update the documentation with your changes [if needed]?
  • Did you write any new necessary tests [no need for typos, docs]?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

  • Is this pull request ready for review? (if not, please submit in draft mode)
  • Check that all items from Before submitting are resolved
  • Make sure the title is self-explanatory and the description concisely explains the PR
  • Add labels and milestones (and optionally projects) to the PR so it can be classified
  • Check that target branch and milestone are aligned!

@pep8speaks
Copy link

pep8speaks commented Jan 5, 2021

Hello @tchaton! Thanks for updating this PR.

Line 198:121: E501 line too long (125 > 120 characters)

Line 26:121: E501 line too long (140 > 120 characters)
Line 54:121: E501 line too long (154 > 120 characters)

Line 94:111: E231 missing whitespace after ':'
Line 94:121: E501 line too long (125 > 120 characters)

Line 89:121: E501 line too long (138 > 120 characters)
Line 117:121: E501 line too long (125 > 120 characters)

Line 111:121: E501 line too long (135 > 120 characters)
Line 129:121: E501 line too long (135 > 120 characters)

Line 116:121: E501 line too long (140 > 120 characters)
Line 123:121: E501 line too long (128 > 120 characters)
Line 130:121: E501 line too long (124 > 120 characters)
Line 139:1: E302 expected 2 blank lines, found 0
Line 143:121: E501 line too long (129 > 120 characters)
Line 153:121: E501 line too long (127 > 120 characters)

Comment last updated at 2021-01-11 15:00:28 UTC

@tchaton tchaton changed the title wip Enable NaN loss on training_step + DDP Jan 5, 2021
@codecov
Copy link

codecov bot commented Jan 5, 2021

Codecov Report

Merging #5359 (cae0437) into master (a40e3a3) will decrease coverage by 0%.
The diff coverage is 82%.

@@          Coverage Diff           @@
##           master   #5359   +/-   ##
======================================
- Coverage      93%     93%   -0%     
======================================
  Files         134     134           
  Lines        9976   10018   +42     
======================================
+ Hits         9294    9326   +32     
- Misses        682     692   +10     

@carmocca carmocca changed the title Enable NaN loss on training_step + DDP Handle invalid training_step losses in DDP Jan 5, 2021
@@ -297,6 +297,11 @@ def distributed_sampler_kwargs(self):

return kwargs

def print_colored_rank(self, msg):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this for debugging the draft or do you want to merge it?

if so, i'd do it separately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will remove it when when the PR is finished and submit a better version in another PR.

Comment on lines 874 to 875
"decision_on_invalid_result ``never_skip`` doesn't support returning None for training_step. "
"Hint: Either switch to ``skip_on_at_least_one`` or return a Nan or Inf loss directly"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do this for the user instead of raising an Exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to leave a MisConfigurationException here.

This would be triggered only when people change the property invalid_loss_strategy to return never_skip. I think they should be aware of what is happening and change their code accordingly.
I don't think Lightning should choose between the 2 modes.

Copy link
Contributor

@carmocca carmocca Jan 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm talking about return a Nan or Inf loss directly

it might be cumbersome to have to do (in training_step):

if loss is None:
    return torch.tensor(float('nan'))

I am ok with keeping skip_if_any as the default but when users change the strategy to never_skip, they are already choosing what they want

@tchaton tchaton self-assigned this Jan 6, 2021
@tchaton tchaton added this to the 1.1.x milestone Jan 6, 2021
@tchaton tchaton removed this from the 1.1.x milestone Jan 6, 2021
@carmocca carmocca self-assigned this Jan 25, 2021
@edenlightning edenlightning modified the milestones: 1.2, 1.2.x Feb 8, 2021
Base automatically changed from master to release/1.1.x February 11, 2021 14:30
@carmocca carmocca modified the milestones: 1.2.x, 1.3 Feb 22, 2021
@edenlightning
Copy link
Contributor

@tchaton is there anything left TODO here or can you close?

@carmocca
Copy link
Contributor

carmocca commented Mar 3, 2021

@tchaton is there anything left TODO here or can you close?

We need to address all the TODOs mentioned at the top. So many things left

@Borda
Copy link
Member

Borda commented May 11, 2021

@tchaton any update here?

@edenlightning edenlightning added this to the v1.3.x milestone Jul 1, 2021
@edenlightning edenlightning added the priority: 1 Medium priority task label Jul 6, 2021
@edenlightning edenlightning modified the milestones: v1.3.x, V1.4.X Jul 6, 2021
@Borda
Copy link
Member

Borda commented Sep 23, 2021

seem this quite old PR with several handerts of commits behind the master, consider finishing it or closing as most likely the conflicts will make the PR challenging to finish... 🐰

@carmocca
Copy link
Contributor

We can close this for now and use a new PR later.

@sailordiary
Copy link
Contributor

Hi, is this issue being tracked elsewhere?

@carmocca
Copy link
Contributor

carmocca commented Jan 4, 2022

There's the issue links at the top:

Fixes #4956, #5243 (DDP), Fixes #4524 (AMP)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working design Includes a design discussion distributed Generic distributed-related topic feature Is an improvement or enhancement has conflicts priority: 1 Medium priority task
Projects
None yet
7 participants