When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

Dehde · 2021-05-28T10:49:04Z

🐛 Bug

I want to let the training step also return the probabilities that were predicted in the step in order to calculate the f1-score for the entire epoch. In order to do so I let the training step return the predictions and the ground truth values. I follow the instructions here: https://pytorch-lightning.readthedocs.io/en/latest/common/lightning_module.html#train-epoch-level-operations.

It mentions the "loss" in the code sample, yet nowhere did I read that you have to specify the loss in there. But you do have to specify that, otherwise an exception will be trown:

> Traceback (most recent call last):
  File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 163, in <module>
    model.fit(train_ds, valid_ds)
  File "/Users/rob/PycharmProjects/bauteilerkennung/DeepLearningPart/bte/models/image.py", line 74, in fit
    self.trainer.fit(self.model, train_loader, val_loader)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
    self._run(model)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
    self.dispatch()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
    self.accelerator.start_training(self)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
    return self.run_train()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 869, in run_train
    self.train_loop.run_training_epoch()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 490, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 731, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 432, in optimizer_step
    using_lbfgs=is_lbfgs,
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 329, in optimizer_step
    self.run_optimizer_step(optimizer, opt_idx, lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in run_optimizer_step
    self.training_type_plugin.optimizer_step(optimizer, lambda_closure=lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 193, in optimizer_step
    optimizer.step(closure=lambda_closure, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/torch/optim/adam.py", line 66, in step
    loss = closure()
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 726, in train_step_and_backward_closure
    split_batch, batch_idx, opt_idx, optimizer, self.trainer.hiddens
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 814, in training_step_and_backward
    result = self.training_step(split_batch, batch_idx, opt_idx, hiddens)
  File "/Users/rob/anaconda3/envs/bte/lib/python3.6/site-packages/pytorch_lightning/trainer/training_loop.py", line 301, in training_step
    closure_loss = training_step_output.minimize / self.trainer.accumulate_grad_batches
TypeError: unsupported operand type(s) for /: 'NoneType' and 'int'

The code that produces this bug:

    class TransferLearner(pl.LightningModule):
      ...
      def training_step(self, train_batch, batch_idx):
          x, y = train_batch
          logits = self.forward(x)
          loss = self.cross_entropy_loss(logits, y)
          self.log('train_loss', loss)
          preds = F.softmax(logits, dim=1)
          return {"preds": preds, "gt": y}
  
      def training_epoch_end(self, outputs):
          preds = torch.cat([output["preds"] for output in outputs])
          gt = torch.cat([output["gt"] for output in outputs])
          f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
          self.log("train/f1_score", f1_score)

The only change I needed to make is to add the loss in the dictionary that returns the items:

    def training_step(self, train_batch, batch_idx):
        x, y = train_batch
        logits = self.forward(x)
        loss = self.cross_entropy_loss(logits, y)
        self.log('train_loss', loss)
        preds = F.softmax(logits, dim=1)
        return {"loss": loss, "preds": preds, "gt": y}

    def training_epoch_end(self, outputs):
        preds = torch.cat([output["preds"] for output in outputs])
        gt = torch.cat([output["gt"] for output in outputs])
        f1_score = torchmetrics.functional.f1(preds, gt, num_classes=self.num_classes)
        self.log("train/f1_score", f1_score)

But I only stumbled upon this very randomly. Neither the docs nor the error message made it very clear to me what the problem was.

The versions that I am using:
pytorch_lightning.__version__
'1.3.1'
torch.__version__
'1.7.1'

I hope this information suffices, otherwise please let me know what other information I should provide.
Thanks for this great repository by the way!!

The text was updated successfully, but these errors were encountered:

justusschock · 2021-05-28T11:52:13Z

Hi @Dehde ,

This is shown here. Do you think this is not obvious enough?

Dehde · 2021-05-28T13:35:08Z

Hi @justusschock , thanks for the quick reply!
I mean, it's in the code snippet, but I don't see it say "if you don't return the loss ther will be an exception".
Ideally, 2 changes would be nice:

Throw a more readable exception when the loss is missing in what the training_step returns
In any definition of the training step, it would be nice to see the loss being returned.

Regarding point 2: For my first working version I just copied some demo code. In that code, the training_step did not return anything. So I was quite surprised that once I added a return statement that the code crashed with what seemed to me an unintuitive exception.

I mean I know it now and it definitely is a beginner's mistake. But 2 other people I know recently started with trying ou oytorch lightning and ran into the same issue, that's also why I wanted to raise this issue here.

justusschock · 2021-05-28T14:01:39Z

@Dehde thanks for the feedback, we really appreciate it :)

I think it should be part of every example (either as a tensor or as a key in the dict). I will make sure to get it to a proper warning/error message.

When you encounter an example that doesn't have this, please either notify me (or one of the team) or open a pull-request to adjust the example yourself :)

Dehde · 2021-05-28T14:39:51Z

Cool! Will do!

Borda · 2021-05-31T06:56:37Z

@Dehde have you succeed with solving this issue? 🐰

justusschock · 2021-05-31T07:05:06Z

@Borda I'll leave this open until I added a PR with a proper error message/warning

Dehde added bug Something isn't working help wanted Open to be worked on labels May 28, 2021

justusschock self-assigned this May 28, 2021

justusschock added docs Documentation related refactor labels May 28, 2021

justusschock mentioned this issue May 31, 2021

Add warning to trainstep output #7779

Merged

11 tasks

justusschock closed this as completed in #7779 Jun 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

Dehde commented May 28, 2021 •

edited by Borda

Loading

justusschock commented May 28, 2021

Dehde commented May 28, 2021

justusschock commented May 28, 2021

Dehde commented May 28, 2021

Borda commented May 31, 2021

justusschock commented May 31, 2021

When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

When returning value in "training_step" function, one must specify "loss", otherwise results in error #7750

Comments

Dehde commented May 28, 2021 • edited by Borda Loading

🐛 Bug

justusschock commented May 28, 2021

Dehde commented May 28, 2021

justusschock commented May 28, 2021

Dehde commented May 28, 2021

Borda commented May 31, 2021

justusschock commented May 31, 2021

Dehde commented May 28, 2021 •

edited by Borda

Loading