[bugfix] Logging only on `not should_accumulate()` during training #5417

tchaton · 2021-01-08T12:13:18Z

What does this PR do?

This PR fixes visual logging with accumulated_grad_batches > 1.

Solution: We can't assume the metrics can be averaged, so we will log only when optimizer_step will be called.

Previously

Now.

Code used to generate the visualization

def test_logging_with_accumulate_grad_batches(tmpdir):
    class LitClassifier(pl.LightningModule):
        def __init__(self, hidden_dim=128, learning_rate=1e-3):
            super().__init__()
            self.save_hyperparameters()

            self.l1 = torch.nn.Linear(28 * 28, self.hparams.hidden_dim)
            self.l2 = torch.nn.Linear(self.hparams.hidden_dim, 10)

            self.train_acc = pl.metrics.Accuracy()

        def forward(self, x):
            x = x.view(x.size(0), -1)
            x = torch.relu(self.l1(x))
            return self.l2(x)

        def training_step(self, batch, batch_idx):
            x, y = batch
            y_hat = self(x)
            loss = F.cross_entropy(y_hat, y)
            self.log('train_acc', self.train_acc(y_hat, y), on_step=True, on_epoch=True, prog_bar=True)
            self.log("train_loss",loss)
            return loss

        def validation_step(self, batch, batch_idx):
            x, y = batch
            y_hat = self(x)
            loss = F.cross_entropy(y_hat, y)

        def configure_optimizers(self):
            return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)

    def run_test(model, max_epochs, accumulate_grad_batches, batch_size, num_workers=4):
        dataset = MNIST('', train=True, download=True, transform=transforms.ToTensor())
        mnist_train, mnist_val = random_split(dataset, [55000, 5000])
        train_loader = DataLoader(mnist_train,batch_size)
        val_loader = DataLoader(mnist_val,batch_size)

        trainer = pl.Trainer(
            logger=WandbLogger(name="bug", project='.....', save_dir=".", log_model=False),
            accumulate_grad_batches=accumulate_grad_batches,
            limit_train_batches=100,
            log_every_n_steps=5,
            max_epochs=max_epochs
            )
        trainer.fit(model, train_loader, val_loader)

    model = LitClassifier()

    run_test(model, 3, 1, 32)
    run_test(model, 3, 8, 32)

Fixes #5405 <- this links related issue to this PR

Before submitting

Was this discussed/approved via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you update the CHANGELOG? (not for typos, docs, test updates, or internal minor changes/refactorings)

PR review

Anyone in the community is free to review the PR once the tests have passed.
Before you start reviewing make sure you have read Review guidelines. In short, see the following bullet-list:

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified
Check that target branch and milestone match!

Did you have fun?

Make sure you had fun coding 🙃

codecov · 2021-01-08T13:53:54Z

Codecov Report

Merging #5417 (20b1abf) into master (f2e99d6) will not change coverage.
The diff coverage is 100%.

@@          Coverage Diff           @@
##           master   #5417   +/-   ##
======================================
  Coverage      93%     93%           
======================================
  Files         134     134           
  Lines        9996    9996           
======================================
  Hits         9313    9313           
  Misses        683     683

tests/loggers/test_tensorboard.py

carmocca · 2021-01-08T13:56:44Z

Not familiar with W&B. Can you see the raw data for the epoch graph? Doesn't look right, does it?

tchaton · 2021-01-08T16:39:29Z

Not familiar with W&B. Can you see the raw data for the epoch graph? Doesn't look right, does it?

It is contracted to log on number of optimizer_step.

Here are the other options which I didn't prefer.

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

teddykoker

Great fix :)

Borda

lgtm, just check back-compatibity and chlog

Borda · 2021-01-08T19:52:51Z

pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py

@@ -158,7 +158,7 @@ def cache_training_step_metrics(self, opt_closure_result):
 self.logged_metrics.update(logged_metrics_tmp)
 self.cached_results.legacy_batch_log_metrics.update(logged_metrics_tmp)

- def log_metrics(self, metrics, grad_norm_dic, step=None, log_train_step_metrics=False):
+ def log_metrics(self, metrics, grad_norm_dic, step=None):


is it used by a user, right? then let's hold back compatibility with API and add a warning...

It is internal. LoggerConnector class

Not it is not.

…5417) * resolve bug * resolve tests * update * Update tests/loggers/test_tensorboard.py Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com> (cherry picked from commit a053d75)

resolve bug

affeaf6

tchaton self-assigned this Jan 8, 2021

tchaton added logger Related to the Loggers priority: 0 High priority task labels Jan 8, 2021

tchaton added this to the 1.1.x milestone Jan 8, 2021

tchaton added 2 commits January 8, 2021 14:02

resolve tests

2bc7f45

update

b1521b0

tchaton requested review from awaelchli, Borda and SeanNaren and removed request for awaelchli and Borda January 8, 2021 13:55

carmocca reviewed Jan 8, 2021

View reviewed changes

tests/loggers/test_tensorboard.py Outdated Show resolved Hide resolved

tchaton marked this pull request as ready for review January 8, 2021 16:05

tchaton requested review from justusschock and williamFalcon as code owners January 8, 2021 16:05

Update tests/loggers/test_tensorboard.py

c0ac3bd

Co-authored-by: Carlos Mocholí <carlossmocholi@gmail.com>

carmocca approved these changes Jan 8, 2021

View reviewed changes

Merge branch 'master' into bugfix/5405_logging_with_accumulated_gradient

702df33

teddykoker approved these changes Jan 8, 2021

View reviewed changes

Borda approved these changes Jan 8, 2021

View reviewed changes

ananyahjha93 approved these changes Jan 8, 2021

View reviewed changes

Merge branch 'master' into bugfix/5405_logging_with_accumulated_gradient

20b1abf

tchaton enabled auto-merge (squash) January 8, 2021 21:54

SeanNaren approved these changes Jan 8, 2021

View reviewed changes

tchaton merged commit a053d75 into master Jan 9, 2021

tchaton deleted the bugfix/5405_logging_with_accumulated_gradient branch January 9, 2021 00:35

SeanNaren mentioned this pull request Jan 26, 2021

Regression between Lightning 1.1.3 and 1.1.5 #5656

Closed

duyduc1110 mentioned this pull request Jan 27, 2021

Loss divided by accumulate_grad_batches number #5680

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Logging only on `not should_accumulate()` during training #5417

[bugfix] Logging only on `not should_accumulate()` during training #5417

tchaton commented Jan 8, 2021 •

edited

Loading

codecov bot commented Jan 8, 2021 •

edited

Loading

carmocca commented Jan 8, 2021

tchaton commented Jan 8, 2021 •

edited

Loading

teddykoker left a comment

Borda left a comment

Borda Jan 8, 2021

tchaton Jan 8, 2021

tchaton Jan 9, 2021

[bugfix] Logging only on not should_accumulate() during training #5417

[bugfix] Logging only on not should_accumulate() during training #5417

Conversation

tchaton commented Jan 8, 2021 • edited Loading

What does this PR do?

Before submitting

PR review

Did you have fun?

codecov bot commented Jan 8, 2021 • edited Loading

Codecov Report

carmocca commented Jan 8, 2021

tchaton commented Jan 8, 2021 • edited Loading

teddykoker left a comment

Choose a reason for hiding this comment

Borda left a comment

Choose a reason for hiding this comment

Borda Jan 8, 2021

Choose a reason for hiding this comment

tchaton Jan 8, 2021

Choose a reason for hiding this comment

tchaton Jan 9, 2021

Choose a reason for hiding this comment

[bugfix] Logging only on `not should_accumulate()` during training #5417

[bugfix] Logging only on `not should_accumulate()` during training #5417

tchaton commented Jan 8, 2021 •

edited

Loading

codecov bot commented Jan 8, 2021 •

edited

Loading

tchaton commented Jan 8, 2021 •

edited

Loading