-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507
Comments
Did you passed any 'agg_key_funcs' to the logger class? If I understand the code correctly, by default np.mean is used to aggregate the dict values returned during training. Maybe numpy tries in the mean function to add (+ func) values which can't be summed up? Can you maybe post the code snippets where you return the metrics to log in the lightning module and the initialization of the logger if you use one? If you don't use a logger, you can disable it by passing logger=False to the trainer (don't know if your previous version had logger on by default). Hope I can help :) |
Thanks for the quick reply!
This only happens when there is a step in time where two times stuff is logged, right? So my guess is that at some point that is the case that two logs have to be "unified" but this fails, because I'm using "dict in dicts". I need this tho, because I want to have i.e. loss train and val in the same graph. I'm using the TestTubeLogger: The metric logging to lightning is a bit scattered:
|
Do you pass it after training_step or training_epoch_end? I think lightning collects your logs and tries to aggregate it to one value. I can't test it now. Maybe tomorrow. But when I quickly type this into python interpreter:
Seems like getting your error. Maybe print what you exactly return and when it crashes. When I have time tomorrow, I will also make some tests. |
After training_step. I not have a training_epoch_end or training_end method defined.
Yes I think so as well. Ok I return something like this:
What do you mean by when it crashes exactly? I think when it crashes it's always the train step after an validation step (keep in mind I'm validation several times during one epoch). If I change the val_check_interval the error either disappears or happens at a different batch number. |
Hello. Is it possible for you to flatten the dictionary? |
@alexeykarnachev Hey! Ah yes that's what I thought. Do you know why the metrics dict is enforced to be of this type? In 0.5.x this was not an issue as far as I know. I mean, yes I can flatten it but I want to have i.e. val/loss and train/loss in the same graph. It's basically this: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_scalars I know that here #1144 (comment) It was said that this should not be done, but for me this is essential. Is there a way that I can overwrite the merge_dicts function? If so how would I do that? |
@fellnerse Okay, I got your point, let's ask Borda's advice) |
I ques it can be used, just need to care about the depth and the aggregation will be a bit complicated... |
Cool, thanks for implementing this so fast! |
🐛 Bug
After I updated from a quite old lightning version to the newest one, I sometimes get a TypeError from merge_dicts. I guess it's related to this MR #1278 . This Type error is deterministic, meaning it always occurs at the same global step during training. It somehow seems to be related to val_check_interval as well. For some data changing this value leads to no Error. But for other datasets this does not work. Also this only happens during training step, I suspect the training step after validating.
To Reproduce
Steps to reproduce the behavior:
I have no Idea.
Sometimes its also 'dict' and 'int'
Expected behavior
At least should not break training, but maybe a more verbose message what is wrong. Its quite hard for me to debug, as the structure of the logs I'm returning to lightning does not change.
Environment
Additional context
Also for some reason some runs have an issue with multiprocessing, but it does not break the training:
The text was updated successfully, but these errors were encountered: