Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

Closed
Dehde opened this issue May 28, 2021 · 6 comments · Fixed by #7779
Assignees
Labels
bug Something isn't working docs Documentation related help wanted Open to be worked on refactor

Comments

@Dehde
Copy link

Dehde commented May 28, 2021

🐛 Bug

I want to let the training step also return the probabilities that were predicted in the step in order to calculate the f1-score for the entire epoch. In order to do so I let the training step return the predictions and the ground truth values. I follow the instructions here: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#train-epoch-level-operations.

It mentions the "loss" in the code sample, yet nowhere did I read that you have to specify the loss in there. But you do have to specify that, otherwise an exception will be trown:

> Traceback (most recent call last):
  File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 163, in <module>
    model.fit(train_ds, valid_ds)
  File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 74, in fit
    self.trainer.fit(self.model, train_loader, val_loader)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 731, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 432, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 726, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 814, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 301, in training_step
    closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

The code that produces this bug:

    class TransferLearner(pl.LightningModule):
      ...
      def training_step(self, train_batch, batch_idx):
          x, y = train_batch
          logits = self.forward(x)
          loss = self.cross_entropy_loss(logits, y)
          self.log('train_loss', loss)
          preds = F.softmax(logits, dim=1)
          return {"preds": preds, "gt": y}
  
      def training_epoch_end(self, outputs):
          preds = torch.cat([output["preds"] for output in outputs])
          gt = torch.cat([output["gt"] for output in outputs])
          f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
          self.log("train/f1_score", f1_score)

The only change I needed to make is to add the loss in the dictionary that returns the items:

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        self.log('train_loss', loss)
        preds = F.softmax(logits, dim=1)
        return {"loss": loss, "preds": preds, "gt": y}

    def training_epoch_end(self, outputs):
        preds = torch.cat([output["preds"] for output in outputs])
        gt = torch.cat([output["gt"] for output in outputs])
        f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
        self.log("train/f1_score", f1_score)

But I only stumbled upon this very randomly. Neither the docs nor the error message made it very clear to me what the problem was.

The versions that I am using:
pytorch_lightning.__version__
'1.3.1'
torch.__version__
'1.7.1'

I hope this information suffices, otherwise please let me know what other information I should provide.
Thanks for this great repository by the way!!

@Dehde Dehde added bug Something isn't working help wanted Open to be worked on labels May 28, 2021
@justusschock
Copy link
Member

Hi @Dehde ,

This is shown here. Do you think this is not obvious enough?

@Dehde
Copy link
Author

Dehde commented May 28, 2021

Hi @justusschock , thanks for the quick reply!
I mean, it's in the code snippet, but I don't see it say "if you don't return the loss ther will be an exception".
Ideally, 2 changes would be nice:

  1. Throw a more readable exception when the loss is missing in what the training_step returns
  2. In any definition of the training step, it would be nice to see the loss being returned.

Regarding point 2: For my first working version I just copied some demo code. In that code, the training_step did not return anything. So I was quite surprised that once I added a return statement that the code crashed with what seemed to me an unintuitive exception.

I mean I know it now and it definitely is a beginner's mistake. But 2 other people I know recently started with trying ou oytorch lightning and ran into the same issue, that's also why I wanted to raise this issue here.

@justusschock
Copy link
Member

@Dehde thanks for the feedback, we really appreciate it :)

I think it should be part of every example (either as a tensor or as a key in the dict). I will make sure to get it to a proper warning/error message.

When you encounter an example that doesn't have this, please either notify me (or one of the team) or open a pull-request to adjust the example yourself :)

@justusschock justusschock self-assigned this May 28, 2021
@justusschock justusschock added docs Documentation related refactor labels May 28, 2021
@Dehde
Copy link
Author

Dehde commented May 28, 2021

Cool! Will do!

@Borda
Copy link
Member

Borda commented May 31, 2021

@Dehde have you succeed with solving this issue? 🐰

@justusschock
Copy link
Member

@Borda I'll leave this open until I added a PR with a proper error message/warning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working docs Documentation related help wanted Open to be worked on refactor
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants