Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss divided by accumulate_grad_batches number #5680

Closed
duyduc1110 opened this issue Jan 27, 2021 · 2 comments
Closed

Loss divided by accumulate_grad_batches number #5680

duyduc1110 opened this issue Jan 27, 2021 · 2 comments
Assignees
Labels
bug Something isn't working help wanted Open to be worked on logging Related to the `LoggerConnector` and `log()` priority: 0 High priority task waiting on author Waiting on user action, correction, or update

Comments

@duyduc1110
Copy link

duyduc1110 commented Jan 27, 2021

🐛 Bug

After the 1.1.4 with the fix 5417, logging was fixed but my loss was divided by accumulate_grad_batches.

Please reproduce using the BoringModel

Sorry, there is no BoringModel. I paste my code here

To Reproduce

    def training_step(self, batch, batch_idx, optimizer_idx):
        with autocast():
            outs = self.model(**batch)
            gen_loss = outs['gen_loss']
            dis_loss = outs['dis_loss']

        (gen_opt, dis_opt) = self.optimizers()

        # Manual backward Generator loss
        self.manual_backward(gen_loss, gen_opt)
        gen_opt.step()

        # Manual backward Discriminator loss
        self.manual_backward(dis_loss, dis_opt)
        dis_opt.step()

        # Accumulate losses
        self.loss_accumulation(gen_loss.cpu().item(), dis_loss.cpu().item())

        # Log all losses
        self.log_metrics()

        # Logging total loss to the progress bar
        total_loss = (gen_loss + dis_loss).cpu().item()
        self.log('total_loss', total_loss, prog_bar=True, logger=False, on_step=True, on_epoch=False) # THIS IS THE BUG

I use DDP with manual backward.

If I used my environment w pl=1.1.0 or accumulate_grad_batches=1 (version 1.1.6), the loss is around 11 at first:

image

If I used accummulate_grad_batches=3, the loss is divided by 3:

image

Expected behavior

Loss should not be divided.
I guess 1.1.3 and before, train_loop sums all loss then average. Now it divides by accumulate_grad_batches then sum.

Environment

  • PyTorch Version (e.g., 1.0): 1.7.1
  • OS (e.g., Linux): Ubuntu 18.04
  • How you installed PyTorch (conda, pip, source): conda
  • Build command you used (if compiling from source):
  • Python version: 3.8
  • CUDA/cuDNN version: 10.1
  • GPU models and configuration: Quadro RTX8000
  • Any other relevant information:
@duyduc1110 duyduc1110 added bug Something isn't working help wanted Open to be worked on labels Jan 27, 2021
@Borda Borda added the logging Related to the `LoggerConnector` and `log()` label Jan 29, 2021
@edenlightning edenlightning added the priority: 0 High priority task label Feb 9, 2021
@kaushikb11
Copy link
Contributor

kaushikb11 commented Feb 11, 2021

Hi @duyduc1110, I am not able to reproduce the bug. Could you share the code in collab by any chance? Would be helpful. Also, could you please try with the latest version?

@edenlightning edenlightning added the waiting on author Waiting on user action, correction, or update label Feb 16, 2021
@edenlightning
Copy link
Contributor

Please feel free to reopen with a reproducible example!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Open to be worked on logging Related to the `LoggerConnector` and `log()` priority: 0 High priority task waiting on author Waiting on user action, correction, or update
Projects
None yet
Development

No branches or pull requests

5 participants