Metric for robustness in real-world scenarios #199

alexriedel1 · 2022-04-05T19:04:41Z

Hey all,
in real-world scenarios a useful metric for a classifier would be the anomaly-score-distance between all normal and anormal predictions in a test set. This would help to choose classifiers that are more confident in their predictions.

Do you know if a metric exists that implements this in 2-class classification problems?

alexriedel1 · 2022-04-06T04:48:43Z

Thats actually the mean (squared) error, also referred to as brier score In this scenario 🙄

samet-akcay · 2022-04-08T10:55:00Z

@alexriedel1, do you think this is something we should be adding to the repo? If yes, would you be able to provide more detail regarding the use-case?

alexriedel1 · 2022-04-08T12:02:12Z

Let me elaborate a bit more. Neither the AUROC nor the F1 capture the prediction confidence. While we reached a state of total Recall of datasets like MVTec, it's rather important to be confident about predictions than just finding the right category. Especially in real world scenarios where a pipeline for anomaly detection needs to be set up that people use on a daily basis, it's important that minor variances don't lead to wrong predictions (which might be the case if a predictor is not very confident about its prediction).

The Brier score is basically the MSE (loss) of the predictions.

The following code illustrates the issue.

import torch
from torch import Tensor
from torchmetrics import Metric, F1, ROC
from torchmetrics.functional import auc

class AUROC(ROC):
    """Area under the ROC curve."""

    def compute(self) -> Tensor:
        """First compute ROC curve, then compute area under the curve.
        Returns:
            Value of the AUROC metric
        """
        tpr: Tensor
        fpr: Tensor

        fpr, tpr, _thresholds = super().compute()
        # TODO: use stable sort after upgrading to pytorch 1.9.x (https://github.com/openvinotoolkit/anomalib/issues/92)
        if not (torch.all(fpr.diff() <= 0) or torch.all(fpr.diff() >= 0)):
            return auc(fpr, tpr, reorder=True)  # only reorder if fpr is not increasing or decreasing
        return auc(fpr, tpr)

class Brier(Metric):
    """Brier Score."""

    def __init__(self, dist_sync_on_step=False):
        super().__init__(dist_sync_on_step=dist_sync_on_step)

        self.add_state("preds", default=[])
        self.add_state("target", default=[])

    def update(self, preds: torch.Tensor, target: torch.Tensor):
        """Pushes all predicitions to list."""
        assert preds.shape == target.shape
        self.preds.append(preds)
        self.target.append(target)

    def compute(self):
        """Compute the Brier Score (MSE).
        Returns:
            Value of the Brier Metric
        """
        preds = torch.cat(self.preds, dim=0).cpu()
        target = torch.cat(self.target, dim=0).cpu()
        y_true = (target == 1).long()
        return torch.mean((y_true - preds) ** 2)

brier = Brier()
f1 = F1()
auroc = AUROC()

preds = torch.Tensor([0.2, 0.2, 0.2, 0.9, 0.9, 0.9, 0.9])
true = torch.Tensor([0, 0, 0, 1, 1, 1, 1]).int()

brier.update(preds, true)
f1.update(preds, true)
auroc.update(preds, true)

print("GT", true)
print("High Confidence in prediction")
print("Preds", preds)
print(f"Brier Loss: {brier.compute():.3f}")
print(f"F1: {f1.compute()}")
print(f"AUROC: {auroc.compute()}")

preds = torch.Tensor([0.49, 0.49, 0.49, 0.51, 0.51, 0.51, 0.51])

brier.reset()
f1.reset()
auroc.reset()

brier.update(preds, true)
f1.update(preds, true)
auroc.update(preds, true)

print("Low Confidence in prediction")
print("Preds", preds)
print(f"Brier Loss: {brier.compute():.3f}")
print(f"F1: {f1.compute()}")
print(f"AUROC: {auroc.compute()}")

GT tensor([0, 0, 0, 1, 1, 1, 1], dtype=torch.int32)
High Confidence in prediction
Preds tensor([0.2000, 0.2000, 0.2000, 0.9000, 0.9000, 0.9000, 0.9000])
Brier Loss: 0.023
F1: 1.0
AUROC: 1.0
Low Confidence in prediction
Preds tensor([0.4900, 0.4900, 0.4900, 0.5100, 0.5100, 0.5100, 0.5100])
Brier Loss: 0.240
F1: 1.0
AUROC: 1.0

samet-akcay · 2022-04-12T06:35:41Z

@alexriedel1, thanks for clarifying this. This would indeed be useful in certain usecases. We could definitely add this metric to the repo.

Our priority at the moment, however, is to make the metrics optional to the user. Metrics are currently hardcoded and computed for each run. Idealy, the user should be able to choose which metric they want to see. They could of course choose to see all of the metrics like what's happening now, but this is not so efficient in terms of memory use.

alexriedel1 · 2022-04-12T11:57:40Z

Our priority at the moment, however, is to make the metrics optional to the user. Metrics are currently hardcoded and computed for each run. Idealy, the user should be able to choose which metric they want to see. They could of course choose to see all of the metrics like what's happening now, but this is not so efficient in terms of memory use.

Should I write a PR addressing this?

samet-akcay · 2022-04-12T14:09:04Z

Thanks @alexriedel1. @djdameln is already working on a design for this. Meanwhile you could maybe add the Brier score implementation if that's alright with you?

alexriedel1 · 2022-04-12T19:43:27Z

@samet-akcay @djdameln
it would be great, if that implementation works in a way, where the torchmetric metrics can be defined in the config.yaml and are added to the pipeline respectively. This would be an easy way to add custom metrics that are already part of torchmetrics (like MSE).

In kind of pseudocode it would look like:

config.yaml

model:
  metrics: 
    -F1
    -MeanSquaredError

anomaly_module.py

import torchmetrics

img_metrics = [getattr(torchmetrics, module)() for module in self.hparams.model.metrics]
self.image_metrics = MetricCollection(img_metrics, prefix="image_").cpu()

I hope this makes my intention a bit clear.

samet-akcay · 2022-04-13T17:23:52Z

@alexriedel1, thanks, it is clear. You could perhaps have a look at WiP PR #230.

djdameln · 2022-04-21T13:26:02Z

Solved with #230. The desired metric can be included by adding MeanSquaredError to the list of metrics.

alexriedel1 closed this as completed Apr 6, 2022

alexriedel1 reopened this Apr 6, 2022

samet-akcay added Metrics Metric Component. Enhancement New feature or request labels Apr 8, 2022

djdameln mentioned this issue Apr 12, 2022

Optional performance metrics #226

Closed

djdameln mentioned this issue Apr 13, 2022

Configurable metrics #230

Merged

11 tasks

djdameln closed this as completed Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric for robustness in real-world scenarios #199

Metric for robustness in real-world scenarios #199

alexriedel1 commented Apr 5, 2022 •

edited

Loading

alexriedel1 commented Apr 6, 2022

samet-akcay commented Apr 8, 2022

alexriedel1 commented Apr 8, 2022

samet-akcay commented Apr 12, 2022

alexriedel1 commented Apr 12, 2022

samet-akcay commented Apr 12, 2022

alexriedel1 commented Apr 12, 2022 •

edited

Loading

samet-akcay commented Apr 13, 2022

djdameln commented Apr 21, 2022

Metric for robustness in real-world scenarios #199

Metric for robustness in real-world scenarios #199

Comments

alexriedel1 commented Apr 5, 2022 • edited Loading

alexriedel1 commented Apr 6, 2022

samet-akcay commented Apr 8, 2022

alexriedel1 commented Apr 8, 2022

samet-akcay commented Apr 12, 2022

alexriedel1 commented Apr 12, 2022

samet-akcay commented Apr 12, 2022

alexriedel1 commented Apr 12, 2022 • edited Loading

samet-akcay commented Apr 13, 2022

djdameln commented Apr 21, 2022

alexriedel1 commented Apr 5, 2022 •

edited

Loading

alexriedel1 commented Apr 12, 2022 •

edited

Loading