After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

fellnerse · 2020-04-16T08:01:01Z

🐛 Bug

After I updated from a quite old lightning version to the newest one, I sometimes get a TypeError from merge_dicts. I guess it's related to this MR #1278 . This Type error is deterministic, meaning it always occurs at the same global step during training. It somehow seems to be related to val_check_interval as well. For some data changing this value leads to no Error. But for other datasets this does not work. Also this only happens during training step, I suspect the training step after validating.

To Reproduce

Steps to reproduce the behavior:

I have no Idea.

File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 363, in train
    self.run_training_epoch()
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/training_loop.py", line 470, in run_training_epoch
    self.log_metrics(batch_step_metrics, grad_norm_dic)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/trainer/logging.py", line 74, in log_metrics
    self.logger.agg_and_log_metrics(scalar_metrics, step=step)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 128, in agg_and_log_metrics
    agg_step, metrics_to_log = self._aggregate_metrics(metrics=metrics, step=step)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 101, in _aggregate_metrics
    agg_step, agg_mets = self._finalize_agg_metrics()
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 116, in _finalize_agg_metrics
    agg_mets = merge_dicts(self._metrics_to_agg, self._agg_key_funcs, self._agg_default_func)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/pytorch_lightning/loggers/base.py", line 347, in merge_dicts
    agg_val = fn([v for v in [d_in.get(k) for d_in in dicts] if v is not None])
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 3118, in mean
    out=out, **kwargs)
  File "/home/sebastian/.cache/pypoetry/virtualenvs/forgerydetection-iC5ox0X1-py3.7/lib/python3.7/site-packages/numpy/core/_methods.py", line 75, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

Sometimes its also 'dict' and 'int'

Expected behavior

At least should not break training, but maybe a more verbose message what is wrong. Its quite hard for me to debug, as the structure of the logs I'm returning to lightning does not change.

Environment

cuda:
        GPU:
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
                GeForce RTX 2080 Ti
        available:           True
        version:             10.1.243
packages:
        numpy:               1.16.4
        pyTorch_debug:       False
        pyTorch_version:     1.3.0
        pytorch-lightning:   0.7.3
        tensorboard:         2.2.0
        tqdm:                4.45.0
system:
        OS:                  Linux
        architecture:
                64bit
                ELF
        processor:           x86_64
        python:              3.7.7
        version:             #97~16.04.1-Ubuntu SMP Wed Apr 1 03:03:31 UTC 2020

Additional context

Also for some reason some runs have an issue with multiprocessing, but it does not break the training:

Traceback (most recent call last):████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:00<00:00,  8.76it/s]
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/util.py", line 277, in _run_finalizers
    finalizer()
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/util.py", line 201, in __call__
    res = self._callback(*self._args, **self._kwargs)
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/multiprocessing/util.py", line 110, in _remove_temp_dir
    rmtree(tempdir)
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/shutil.py", line 498, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/home/sebastian/.pyenv/versions/3.7.7/lib/python3.7/shutil.py", line 496, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/tmp/pymp-jcqai2xr'

The text was updated successfully, but these errors were encountered:

olineumann · 2020-04-16T15:45:31Z

Did you passed any 'agg_key_funcs' to the logger class? If I understand the code correctly, by default np.mean is used to aggregate the dict values returned during training. Maybe numpy tries in the mean function to add (+ func) values which can't be summed up?

Can you maybe post the code snippets where you return the metrics to log in the lightning module and the initialization of the logger if you use one? If you don't use a logger, you can disable it by passing logger=False to the trainer (don't know if your previous version had logger on by default).

Hope I can help :)

fellnerse · 2020-04-16T16:43:42Z

Thanks for the quick reply!
No I'm not using any 'agg_key_funcs' that I know of.

If I understand the code correctly, by default np.mean is used to aggregate the dict values returned during training.

This only happens when there is a step in time where two times stuff is logged, right? So my guess is that at some point that is the case that two logs have to be "unified" but this fails, because I'm using "dict in dicts". I need this tho, because I want to have i.e. loss train and val in the same graph.

I'm using the TestTubeLogger:
logger = TestTubeLogger(save_dir=log_dir, name=name, description=description)
and just pass this to the Trainer.

The metric logging to lightning is a bit scattered:

train_step in model:

       x, target = batch
       pred = self.forward(x)
       loss = self.loss(pred, target)
       lightning_log = {"loss": loss}

       with torch.no_grad():
           train_acc = self.calculate_accuracy(pred, target)
           tensorboard_log = {"loss": loss, "acc": train_acc}

       return tensorboard_log, lightning_log

this is passed to a function that lets me add train and val to same graph:

    def _construct_lightning_log(
        self,
        tensorboard_log: dict,
        lightning_log: dict = None,
        suffix: str = "train",
        prefix: str = "metrics",
    ):
        lightning_log = lightning_log or {}
        fixed_log = {}

        for metric, value in tensorboard_log.items():
            if isinstance(value, dict):
                fixed_log[f"{prefix}/{metric}"] = value
            else:
                fixed_log[f"{prefix}/{metric}"] = {suffix: value}
        return {"log": fixed_log, **lightning_log}

olineumann · 2020-04-16T20:39:29Z

Do you pass it after training_step or training_epoch_end? I think lightning collects your logs and tries to aggregate it to one value. I can't test it now. Maybe tomorrow.

But when I quickly type this into python interpreter:

>>> d={}
>>> np.mean([d,d])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<__array_function__ internals>", line 5, in mean
  File "/usr/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 3334, in mean
    return _methods._mean(a, axis=axis, dtype=dtype,
  File "/usr/lib/python3.8/site-packages/numpy/core/_methods.py", line 151, in _mean
    ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: unsupported operand type(s) for +: 'dict' and 'dict'

Seems like getting your error.

Maybe print what you exactly return and when it crashes. When I have time tomorrow, I will also make some tests.

fellnerse · 2020-04-17T11:01:44Z

After training_step. I not have a training_epoch_end or training_end method defined.

I think lightning collects your logs and tries to aggregate it to one value.

Yes I think so as well.

Ok I return something like this:

{'metrics/aud_std': {'test': tensor(1.6337, device='cuda:0')}, 'metrics/class_loss_diff': {'test': tensor(nan)}, 'metrics/class_loss_val': {'0': tensor(nan), '1': tensor(91.5485)}, 'metrics/loss': {'test': tensor(45.7742, device='cuda:0')}, 'metrics/vid_std': {'test': tensor(1.6506, device='cuda:0')}}

What do you mean by when it crashes exactly? I think when it crashes it's always the train step after an validation step (keep in mind I'm validation several times during one epoch). If I change the val_check_interval the error either disappears or happens at a different batch number.

alexeykarnachev · 2020-04-17T19:48:25Z

Hello.
I think the problem is in your metrics type. Metrics must have the Dict[str, float] type. But in your case, the metrics is a nested dict. So, that's why values are failed to be aggregated.

Is it possible for you to flatten the dictionary?

fellnerse · 2020-04-20T06:56:20Z

@alexeykarnachev Hey! Ah yes that's what I thought. Do you know why the metrics dict is enforced to be of this type? In 0.5.x this was not an issue as far as I know.

I mean, yes I can flatten it but I want to have i.e. val/loss and train/loss in the same graph. It's basically this: https://pytorch.org/docs/stable/tensorboard.html#torch.utils.tensorboard.writer.SummaryWriter.add_scalars

I know that here #1144 (comment) It was said that this should not be done, but for me this is essential.

Is there a way that I can overwrite the merge_dicts function? If so how would I do that?

alexeykarnachev · 2020-04-20T12:12:03Z

@fellnerse Okay, I got your point, let's ask Borda's advice)
@Borda, what do you think? Is it possible to combine nested metrics dictionaries with metrics aggregation logic? At first sight, it doesn't look like a big problem. Maybe you can see any side effects of tracking aggregated metrics with nested dictionaries? If no, I can try to fix this issue

Borda · 2020-04-20T14:37:49Z

I ques it can be used, just need to care about the depth and the aggregation will be a bit complicated...

fellnerse · 2020-04-24T15:26:45Z

Cool, thanks for implementing this so fast!

fellnerse added bug Something isn't working help wanted Open to be worked on labels Apr 16, 2020

Borda added this to the 0.7.4 milestone Apr 16, 2020

alexeykarnachev mentioned this issue Apr 23, 2020

Nested metrics dictionaries now can be passed to the loggers #1582

Merged

5 tasks

williamFalcon closed this as completed in #1582 Apr 23, 2020

Borda modified the milestones: 0.7.4, v0.7.x Apr 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

fellnerse commented Apr 16, 2020 •

edited by Borda

Loading

olineumann commented Apr 16, 2020

fellnerse commented Apr 16, 2020 •

edited

Loading

olineumann commented Apr 16, 2020

fellnerse commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020 •

edited

Loading

fellnerse commented Apr 20, 2020

alexeykarnachev commented Apr 20, 2020 •

edited

Loading

Borda commented Apr 20, 2020

fellnerse commented Apr 24, 2020

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

After update from 0.5.x to 0.7.3 merge_dicts #1278 sometimes breaks training #1507

Comments

fellnerse commented Apr 16, 2020 • edited by Borda Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

olineumann commented Apr 16, 2020

fellnerse commented Apr 16, 2020 • edited Loading

olineumann commented Apr 16, 2020

fellnerse commented Apr 17, 2020

alexeykarnachev commented Apr 17, 2020 • edited Loading

fellnerse commented Apr 20, 2020

alexeykarnachev commented Apr 20, 2020 • edited Loading

Borda commented Apr 20, 2020

fellnerse commented Apr 24, 2020

fellnerse commented Apr 16, 2020 •

edited by Borda

Loading

fellnerse commented Apr 16, 2020 •

edited

Loading

alexeykarnachev commented Apr 17, 2020 •

edited

Loading

alexeykarnachev commented Apr 20, 2020 •

edited

Loading