-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics fail on DP and multiple GPU #4353
Comments
If you do metrics update in |
@marrrcin could you provide a bit more detail? Like how and where do you explicitly call metric? (example) |
So according to a conversation with him on slack, if the metric.forward() is called in the This probably has to do with the fact that the output values from the Maybe @marrrcin can clarify when online, but that is what I understood. |
@SkafteNicki METRICS_SUFFIX = "_metrics"
TRAINING = "train"
TESTING = "test"
VALIDATION = "val"
metrics_factory = lambda: nn.ModuleList(
[
pl.metrics.Accuracy(),
pl.metrics.Precision(hparams.num_classes),
pl.metrics.Recall(hparams.num_classes),
pl.metrics.Fbeta(hparams.num_classes),
]
)
self.metrics: nn.ModuleDict = nn.ModuleDict(
{
# you cannot have name `train` in ModuleDict, because nn.Module has function called `train`
TRAINING + METRICS_SUFFIX: metrics_factory(),
VALIDATION + METRICS_SUFFIX: metrics_factory(),
TESTING + METRICS_SUFFIX: metrics_factory(),
}
) Then, some utility functions: @staticmethod
def _get_metric_name(m: pl.metrics.Metric):
return m.__class__.__name__.lower()
def update_metrics(
self,
step_name,
pred_y: torch.Tensor,
true_y: torch.Tensor,
log_metrics=True,
):
for metric in self.metrics[step_name + METRICS_SUFFIX]:
m = metric(pred_y, true_y)
if log_metrics:
self.log(
f"{self._get_metric_name(metric)}/{step_name}",
m,
on_step=False,
on_epoch=True,
) Then, from def training_step_end(self, outputs: dict) -> torch.Tensor:
loss = outputs["loss"].mean()
self.update_metrics(TRAINING, outputs["pred_y"], outputs["true_y"])
# etc... |
@EspenHa I agree that the error is the same, but this happens long before trying to log the metrics. Actually, it has nothing to do with the rest of lightning, but a general incompatibility between lightning metrics and DataParallel:
Even this small example fails with the same error. The core of the problem seems to have something to do with how we register the state as a buffer. As far as I understand, if we use |
Just a small update: That said, another problem with metrics in dp mode have come to my attention. Since dp is creating and destroying replicas of the model on each forward call, the internal state of metrics will be destroyed before we have a chance to accumulate them over the different devices. Therefore, until we implement some kind of state maintenance in dp (PR: #1895), the only way forward right now is (thanks @marrrcin for the workaround):
|
@teddykoker @ananyahjha93 please take a look as well |
This is currently not supported, until we have stateful DP, see #1895 |
@ananyahjha93 what should be next steps? should we change the docs? |
@edenlightning we need to get the state maintenance for DP in before we tackle this. Even if it doesn't solve this completely, that PR has been lying around for a few months. |
Really love PT lightning but still Having this issue @ananyahjha93 do you know what the fix is? Thanks! |
I think the problem is initializing a
Instead, if I initialize the metrics explicitly, I won't have any error:
|
🐛 Bug
When using a metric such as Accuracy from
pytorch_lightning.metrics
in machine with 4 GPU and in 'dp' mode, there is an error due to accumulating the metric in different devices. In the case of Accuracy, in line:https://github.com/PyTorchLightning/pytorch-lightning/blob/c8ccec7a02c53ed38af6ef7193232426384eee4a/pytorch_lightning/metrics/classification/accuracy.py#L108
The arguments in torch.sum are in the same device the metric is been called from, but the
self.correct
is in a different one. The traceback is as follows:Please reproduce using the BoringModel and post here
https://colab.research.google.com/drive/1zcU1ADuHZj82clrBysv-EGfgqG7SxUhN#scrollTo=V7ELesz1kVQo
To Reproduce
The shared colab is not going to be able to replicate the bug since it needs 'dp' on multiple gpus, but it should give an idea of when the error occurs. So setting
in the Trainer and then using a metric should bring up the issue. I have tested it with Accuracy but other users in the Slack channel reported it for other metrics such as Precision or Recall.
Expected behavior
The devices should be the same when the values are added together. I am not sure of which would be the correct approach, I have "brutely" solved it by:
in the case of Accuracy, but that is quite an ugly way of dealing with it.
Update: Although this doesn't produce the error, the accuracy is not properly computed, as values get reset to 0 for some reason between steps.
Environment
- GPU:
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- GeForce GTX 1080 Ti
- available: True
- version: 10.2
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.6.0
- pytorch-lightning: 1.0.3
- tqdm: 4.50.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor:
- python: 3.8.5
- version: Proposal for help #1 SMP Debian 4.19.152-1 (2020-10-18)
The text was updated successfully, but these errors were encountered: