loss=None and no logs when automatic_optimization=False #4204

denadai2 · 2020-10-17T15:57:03Z

🐛 Bug

I think there is a bug when automatic_optimization=False. The loss=None (https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L336) and this means that all the checkpoint_callbacks cannot work. There is no way to set the loss.

I also add that in the documentation (https://pytorch-lightning.readthedocs.io/en/latest/optimizers.html#manual-optimization) the training_step does not return anything. However, if it does not return anything, all the logs do not work because of: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L681.

Expected behavior

There should be a way to set the loss, and the behaviour when nothing is returned in training_step should be clear.

Environment

* CUDA:
        - GPU:
                - GeForce RTX 2080 Ti
                - GeForce RTX 2080 Ti
        - available:         True
        - version:           10.2
* Packages:
        - numpy:             1.19.1
        - pyTorch_debug:     False
        - pyTorch_version:   1.6.0
        - pytorch-lightning: 1.0.2
        - tqdm:              4.48.2
* System:
        - OS:                Linux
        - architecture:
                - 64bit
                - ELF
        - processor:         x86_64
        - python:            3.6.9
        - version:           #26-Ubuntu SMP Mon Jun 24 09:32:08 UTC 2019

The text was updated successfully, but these errors were encountered:

github-actions · 2020-10-17T15:57:44Z

Hi! thanks for your contribution!, great first issue!

edenlightning · 2020-10-22T15:33:23Z

alsp #4295

SeanNaren · 2020-11-01T23:39:51Z

Thanks @denadai2! I'll modify the doc example to report loss value, if you care about logging your loss values (which in most cases is yes!)

denadai2 · 2020-11-02T11:11:20Z

@SeanNaren actually, it's not only about the doc. It's also that if the loss is nan, pytorch lightning skips to write ALL the logged variables because of: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L681

denadai2 · 2020-11-02T11:12:10Z

To be more clear, this step is skipped: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L687

SeanNaren · 2020-11-02T12:04:55Z

Thanks @denadai2 just to confirm the logic is as below:

You've overridden the training step and set automatic_optimization to false.
Your training_step function logs metrics now using self.log, but never returns a loss (you may have multiple losses or something)

You would like other metrics aside from the loss to be logged in training, but because of this line:

https://github.com/PyTorchLightning/pytorch-lightning/blob/f40d08679d31ef6e705f1e0e5a66473c817325e1/pytorch_lightning/trainer/training_loop.py#L721

We never get to this part of the code:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f40d08679d31ef6e705f1e0e5a66473c817325e1/pytorch_lightning/trainer/training_loop.py#L725-L730

I think custom logged metrics in callbacks is the only thing that will not be logged for now (there is a major refactor coming which should fix this #4439 ), and for now you'll need to log using self.log within the lightning module functions. Let me know if this is a solution that solves your issue!

edenlightning · 2020-11-03T18:08:02Z

Currently blocked on #4495

asalimih · 2020-11-04T14:41:48Z

Hi
I'm not sure if my problem is related to this bug it seems so.
because I wanted to do backward and step in the middle of training_step so I set the automatic_optimization to False. I'm logging in the training_step and validation_step as follows:

def training_step(self, batch, batch_idx):
    ...
   self.log('train_loss', loss.view(1,).item(), prog_bar=True)
   self.log('train_ci', train_cindex, prog_bar=True)
   return loss

def validation_step(self, batch, batch_idx):
   ...
   self.log('val_loss', eval_loss.view(1,).item(), prog_bar=True)
   self.log('val_ci'  , eval_cindex, prog_bar=True)
   return eval_loss

It works quite well until it reaches a specific epoch and throws this error (the batch size and the dataset size are the same):

Epoch 49:  50%|██████████████████                  | 1/2 [00:00<00:00,  2.34it/s, loss=nan, v_num=54, train_loss=3.51, train_ci=0.938, val_loss=3.58, val_ci=0.689]Traceback (most recent call last):
  File "trainer_main.py", line 45, in <module>
    trainer.fit(model, tcga_dm)
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 440, in fit
    results = self.accelerator_backend.train()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu_accelerator.py", line 54, in train
    results = self.train_or_test()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 68, in train_or_test
    results = self.trainer.train()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 485, in train
    self.train_loop.run_training_epoch()
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 565, in run_training_epoch
    self.trainer.logger_connector.log_train_step_metrics(batch_output)
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 536, in log_train_step_metrics
    self.log_metrics(metrics, grad_norm_dic)
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector.py", line 79, in log_metrics
    metrics.update(grad_norm_dic)
TypeError: 'NoneType' object is not iterable
Exception ignored in: <function tqdm.__del__ at 0x7f34fe50eca0>
Traceback (most recent call last):
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1122, in __del__
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1335, in close
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1514, in display
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1125, in __repr__
  File "/home/asalimi/.conda/envs/mytorch/lib/python3.8/site-packages/tqdm/std.py", line 1475, in format_dict
TypeError: cannot unpack non-iterable NoneType object

I checked the loss and eval_loss to not be None and they are not None. it seems the error happens right after the return loss.
also I think a documentation is needed to specify how to log and return values in training_step and validation_step. this was a bit confusing for me.

asalimih · 2020-11-04T20:41:51Z

after some debugging I solved the Error by changing the training_step to this:

def training_step(self, batch, batch_idx):
    ...
   self.log('loss', loss.view(1,).item(), prog_bar=True, logger=True)
   self.log('train_ci', train_cindex, prog_bar=True, logger=True)

However now although I set the logger to True, the loss and train_ci aren't logged in the tensorboard. and also I cannot see them in the progress bar.
@SeanNaren #4439 didn't solve this problem

denadai2 · 2020-11-07T16:33:56Z

@asalimih try returning the loss at the end of training_step. This temporanly solves the bug I pointed out.

asalimih · 2020-11-07T20:24:21Z

@asalimih try returning the loss at the end of training_step. This temporanly solves the bug I pointed out.

@denadai2 the first error I explained here had the return loss line. after I removed the line it didn't throw error but now metrics wouldn't be logged.

SeanNaren · 2020-11-09T17:34:02Z

We're currently in a deep dive into automatic_optimization=False behaviour after a lot of different bugs have appeared in different edge case situations. Please have a look at #4485

The logging changes will hopefully make logs a little clearer, but in terms of actual functionality for automatic_optimization we're in the process of debugging.

@asalimih could you reproduce the bug with this? https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py

@denadai2 just finishing the final logging refactor here: #4552

Once we figure out some of the functionality issues with automatic_optimization, i'll circle back here.

edenlightning · 2020-11-09T17:38:02Z

@tchaton

denadai2 added bug Something isn't working help wanted Open to be worked on labels Oct 17, 2020

edenlightning added this to the 1.0.3 milestone Oct 19, 2020

edenlightning added priority: 0 High priority task checkpointing Related to checkpointing labels Oct 20, 2020

edenlightning assigned SeanNaren Oct 22, 2020

SeanNaren added the docs Documentation related label Nov 1, 2020

SeanNaren mentioned this issue Nov 1, 2020

Problems with automatic_optimization=False #4295

Closed

SeanNaren removed priority: 0 High priority task bug Something isn't working help wanted Open to be worked on labels Nov 1, 2020

SeanNaren mentioned this issue Nov 2, 2020

[docs] Add manual logging to training_step manual optimization #4476

Merged

1 task

SeanNaren closed this as completed in #4476 Nov 2, 2020

SeanNaren reopened this Nov 2, 2020

edenlightning unassigned SeanNaren Nov 3, 2020

edenlightning added logger Related to the Loggers and removed checkpointing Related to checkpointing labels Nov 3, 2020

edenlightning assigned SeanNaren Nov 3, 2020

edenlightning added the bug Something isn't working label Nov 3, 2020

tchaton self-assigned this Nov 10, 2020

edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 10, 2020

Borda modified the milestones: 1.0.7, 1.0.x Nov 11, 2020

edenlightning modified the milestones: 1.0.x, 1.0.7 Nov 13, 2020

tchaton closed this as completed Nov 13, 2020

Borda modified the milestones: 1.0.7, 1.0.x Nov 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

loss=None and no logs when automatic_optimization=False #4204

loss=None and no logs when automatic_optimization=False #4204

denadai2 commented Oct 17, 2020

github-actions bot commented Oct 17, 2020

edenlightning commented Oct 22, 2020

SeanNaren commented Nov 1, 2020

denadai2 commented Nov 2, 2020

denadai2 commented Nov 2, 2020

SeanNaren commented Nov 2, 2020 •

edited

Loading

edenlightning commented Nov 3, 2020

asalimih commented Nov 4, 2020

asalimih commented Nov 4, 2020 •

edited

Loading

denadai2 commented Nov 7, 2020

asalimih commented Nov 7, 2020

SeanNaren commented Nov 9, 2020

edenlightning commented Nov 9, 2020

loss=None and no logs when automatic_optimization=False #4204

loss=None and no logs when automatic_optimization=False #4204

Comments

denadai2 commented Oct 17, 2020

🐛 Bug

Expected behavior

Environment

github-actions bot commented Oct 17, 2020

edenlightning commented Oct 22, 2020

SeanNaren commented Nov 1, 2020

denadai2 commented Nov 2, 2020

denadai2 commented Nov 2, 2020

SeanNaren commented Nov 2, 2020 • edited Loading

edenlightning commented Nov 3, 2020

asalimih commented Nov 4, 2020

asalimih commented Nov 4, 2020 • edited Loading

denadai2 commented Nov 7, 2020

asalimih commented Nov 7, 2020

SeanNaren commented Nov 9, 2020

edenlightning commented Nov 9, 2020

SeanNaren commented Nov 2, 2020 •

edited

Loading

asalimih commented Nov 4, 2020 •

edited

Loading