Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log/save_interval based on global step #3667

Merged
merged 14 commits into from
Sep 30, 2020

Conversation

awaelchli
Copy link
Contributor

@awaelchli awaelchli commented Sep 26, 2020

What does this PR do?

This issue came up frequently: nothing is getting logged, what's the problem?
If number of batches in training is < row_log_interval, nothing gets logged.
This PR will fix this by making row_log_interval based on global_step.

The current logging behaviour does not change, except in the edge case where epochs are shorter than logging interval.

Fixes #3466
Fixes #3506

Before submitting

  • Was this discussed/approved via a Github issue? (no need for typos and docs improvements)
  • Did you read the contributor guideline, Pull Request section?
  • Did you make sure your PR does only one thing, instead of bundling different changes together? Otherwise, we ask you to create a separate PR for every change.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?
  • Did you verify new and existing tests pass locally with your changes?
  • If you made a notable change (that affects users), did you update the CHANGELOG?

PR review

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in Github issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

@pep8speaks
Copy link

pep8speaks commented Sep 26, 2020

Hello @awaelchli! Thanks for updating this PR.

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-09-29 14:19:21 UTC

@awaelchli awaelchli added bug Something isn't working feature Is an improvement or enhancement and removed bug Something isn't working labels Sep 26, 2020
@awaelchli awaelchli marked this pull request as ready for review September 26, 2020 22:07
@mergify mergify bot requested a review from a team September 26, 2020 22:08
@awaelchli awaelchli removed the request for review from a team September 26, 2020 22:08
@mergify mergify bot requested a review from a team September 26, 2020 22:09
@rohitgr7
Copy link
Contributor

rohitgr7 commented Sep 26, 2020

Global step updates only on the training step right? I tried to log it after every training_epoch_end with batches=50 and getting results:

Train epoch end: Epoch: 0 Global Step: 49
Train epoch end: Epoch: 1 Global Step: 100
Train epoch end: Epoch: 2 Global Step: 151
Train epoch end: Epoch: 3 Global Step: 202

It is incrementing by an extra +1 after every epoch. Is this expected behavior or a bug?

@awaelchli
Copy link
Contributor Author

@rohitgr7 Yes I remember discussing this before with someone, I can't find the issue or PR unfortunately.
I don't remember what the conclusion was, whether it is a bug or not. The +1 comes from the training_epoch_end, if we log there it creates an extra step. As far as I know, it has always been like this. Ideally, the logs in epoch end should be combined with the last log call of the training loop.

@rohitgr7
Copy link
Contributor

Ideally, the logs in epoch end should be combined with the last log call of the training loop.

Agreed.

@awaelchli Ideally IMO training_epoch_end is not doing any leaning step and since we are now logging w.r.t global_step, it should not increment at training_epoch_end.

@awaelchli
Copy link
Contributor Author

awaelchli commented Sep 26, 2020

so you say we should first fix this global step problem before we can switch to logging on global step? ok, that makes sense.
not sure yet how to approach it.

@awaelchli awaelchli marked this pull request as draft September 26, 2020 23:01
@rohitgr7
Copy link
Contributor

rohitgr7 commented Sep 26, 2020

maybe just removing this line will do the job only if it meant to not update in training_epoch_end.
https://github.com/PyTorchLightning/pytorch-lightning/blob/a94728c99b68085b861f7384c8ecc163b2d401b0/pytorch_lightning/trainer/connectors/logger_connector.py#L237

@awaelchli awaelchli changed the title log/save_interval based on global step [blocked by #3673] log/save_interval based on global step Sep 27, 2020
@awaelchli
Copy link
Contributor Author

@rohitgr7 I'm trying your suggestion and see if any tests fail #3673

@Borda
Copy link
Member

Borda commented Sep 27, 2020

@rohitgr7 I'm trying your suggestion and see if any tests fail #3673

what do you mean by all, we may have the bad sound in some other places already in tests...

@awaelchli awaelchli changed the title [blocked by #3673] log/save_interval based on global step log/save_interval based on global step Sep 28, 2020
@awaelchli awaelchli added the bug Something isn't working label Sep 28, 2020
@awaelchli awaelchli marked this pull request as ready for review September 28, 2020 04:13
@awaelchli awaelchli requested a review from rohitgr7 September 28, 2020 04:14
@awaelchli
Copy link
Contributor Author

It is incrementing by an extra +1 after every epoch. Is this expected behavior or a bug?

@rohitgr7 william fixed it, it should be good now and we can trust global_step to be aligned in all epochs.

@mergify mergify bot requested a review from a team September 28, 2020 06:29
@mergify mergify bot requested a review from a team September 28, 2020 11:22
@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2020

This pull request is now in conflict... :(

Copy link
Contributor

@rohitgr7 rohitgr7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ✌️

@mergify mergify bot requested a review from a team September 28, 2020 19:32
@Borda Borda requested a review from SkafteNicki September 28, 2020 22:28
Copy link
Member

@Borda Borda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@williamFalcon
Copy link
Contributor

awesome fix!

@codecov
Copy link

codecov bot commented Sep 30, 2020

Codecov Report

Merging #3667 into master will increase coverage by 5%.
The diff coverage is 100%.

@@           Coverage Diff           @@
##           master   #3667    +/-   ##
=======================================
+ Coverage      85%     90%    +5%     
=======================================
  Files         110     110            
  Lines        8355    8849   +494     
=======================================
+ Hits         7141    7982   +841     
+ Misses       1214     867   -347     

@awaelchli awaelchli merged commit 9405c88 into master Sep 30, 2020
@awaelchli awaelchli deleted the feature/globalstep-logging-interval branch September 30, 2020 10:26
@Borda Borda added this to the 0.10.0 milestone Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working feature Is an improvement or enhancement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tensorboard display nothing Logging Intervals and Evaluation Intervals
7 participants