Loss divided by `accumulate_grad_batches` number #5680

duyduc1110 · 2021-01-27T14:31:22Z

🐛 Bug

After the 1.1.4 with the fix 5417, logging was fixed but my loss was divided by accumulate_grad_batches.

Please reproduce using the BoringModel

Sorry, there is no BoringModel. I paste my code here

To Reproduce

    def training_step(self, batch, batch_idx, optimizer_idx):
        with autocast():
            outs = self.model(**batch)
            gen_loss = outs['gen_loss']
            dis_loss = outs['dis_loss']

        (gen_opt, dis_opt) = self.optimizers()

        # Manual backward Generator loss
        self.manual_backward(gen_loss, gen_opt)
        gen_opt.step()

        # Manual backward Discriminator loss
        self.manual_backward(dis_loss, dis_opt)
        dis_opt.step()

        # Accumulate losses
        self.loss_accumulation(gen_loss.cpu().item(), dis_loss.cpu().item())

        # Log all losses
        self.log_metrics()

        # Logging total loss to the progress bar
        total_loss = (gen_loss + dis_loss).cpu().item()
        self.log('total_loss', total_loss, prog_bar=True, logger=False, on_step=True, on_epoch=False) # THIS IS THE BUG

I use DDP with manual backward.

If I used my environment w pl=1.1.0 or accumulate_grad_batches=1 (version 1.1.6), the loss is around 11 at first:

If I used accummulate_grad_batches=3, the loss is divided by 3:

Expected behavior

Loss should not be divided.
I guess 1.1.3 and before, train_loop sums all loss then average. Now it divides by accumulate_grad_batches then sum.

Environment

PyTorch Version (e.g., 1.0): 1.7.1
OS (e.g., Linux): Ubuntu 18.04
How you installed PyTorch (conda, pip, source): conda
Build command you used (if compiling from source):
Python version: 3.8
CUDA/cuDNN version: 10.1
GPU models and configuration: Quadro RTX8000
Any other relevant information:

The text was updated successfully, but these errors were encountered:

kaushikb11 · 2021-02-11T15:42:36Z

Hi @duyduc1110, I am not able to reproduce the bug. Could you share the code in collab by any chance? Would be helpful. Also, could you please try with the latest version?

edenlightning · 2021-03-02T19:46:04Z

Please feel free to reopen with a reproducible example!

duyduc1110 added bug Something isn't working help wanted Open to be worked on labels Jan 27, 2021

Borda assigned tchaton Jan 29, 2021

Borda added the logging Related to the `LoggerConnector` and `log()` label Jan 29, 2021

edenlightning unassigned tchaton Feb 9, 2021

edenlightning added the priority: 0 High priority task label Feb 9, 2021

edenlightning assigned kaushikb11 Feb 10, 2021

edenlightning added the waiting on author Waiting on user action, correction, or update label Feb 16, 2021

edenlightning closed this as completed Mar 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss divided by `accumulate_grad_batches` number #5680

Loss divided by `accumulate_grad_batches` number #5680

duyduc1110 commented Jan 27, 2021 •

edited

Loading

kaushikb11 commented Feb 11, 2021 •

edited

Loading

edenlightning commented Mar 2, 2021

Loss divided by accumulate_grad_batches number #5680

Loss divided by accumulate_grad_batches number #5680

Comments

duyduc1110 commented Jan 27, 2021 • edited Loading

🐛 Bug

Please reproduce using the BoringModel

To Reproduce

Expected behavior

Environment

kaushikb11 commented Feb 11, 2021 • edited Loading

edenlightning commented Mar 2, 2021

Loss divided by `accumulate_grad_batches` number #5680

Loss divided by `accumulate_grad_batches` number #5680

duyduc1110 commented Jan 27, 2021 •

edited

Loading

kaushikb11 commented Feb 11, 2021 •

edited

Loading