-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
loss=None and no logs when automatic_optimization=False #4204
Comments
Hi! thanks for your contribution!, great first issue! |
alsp #4295 |
Thanks @denadai2! I'll modify the doc example to report loss value, if you care about logging your loss values (which in most cases is yes!) |
@SeanNaren actually, it's not only about the doc. It's also that if the loss is nan, pytorch lightning skips to write ALL the logged variables because of: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L681 |
To be more clear, this step is skipped: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L687 |
Thanks @denadai2 just to confirm the logic is as below: You've overridden the training step and set automatic_optimization to false. You would like other metrics aside from the loss to be logged in training, but because of this line: We never get to this part of the code: I think custom logged metrics in callbacks is the only thing that will not be logged for now (there is a major refactor coming which should fix this #4439 ), and for now you'll need to log using |
Currently blocked on #4495 |
Hi
It works quite well until it reaches a specific epoch and throws this error (the batch size and the dataset size are the same):
I checked the |
after some debugging I solved the Error by changing the training_step to this:
However now although I set the logger to True, the |
@asalimih try returning the loss at the end of training_step. This temporanly solves the bug I pointed out. |
We're currently in a deep dive into automatic_optimization=False behaviour after a lot of different bugs have appeared in different edge case situations. Please have a look at #4485 The logging changes will hopefully make logs a little clearer, but in terms of actual functionality for automatic_optimization we're in the process of debugging. @asalimih could you reproduce the bug with this? https://github.com/PyTorchLightning/pytorch-lightning/blob/master/pl_examples/bug_report_model.py @denadai2 just finishing the final logging refactor here: #4552 Once we figure out some of the functionality issues with automatic_optimization, i'll circle back here. |
🐛 Bug
I think there is a bug when automatic_optimization=False. The loss=None (https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L336) and this means that all the checkpoint_callbacks cannot work. There is no way to set the loss.
I also add that in the documentation (https://pytorch-lightning.readthedocs.io/en/latest/optimizers.html#manual-optimization) the training_step does not return anything. However, if it does not return anything, all the logs do not work because of: https://github.com/PyTorchLightning/pytorch-lightning/blob/72f19768c828b734d8565ffef7b78fb9a57ba847/pytorch_lightning/trainer/training_loop.py#L681.
Expected behavior
There should be a way to set the loss, and the behaviour when nothing is returned in training_step should be clear.
Environment
The text was updated successfully, but these errors were encountered: