Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metric for robustness in real-world scenarios #199

Closed
alexriedel1 opened this issue Apr 5, 2022 · 9 comments
Closed

Metric for robustness in real-world scenarios #199

alexriedel1 opened this issue Apr 5, 2022 · 9 comments
Labels
Enhancement New feature or request Metrics Metric Component.

Comments

@alexriedel1
Copy link
Contributor

alexriedel1 commented Apr 5, 2022

Hey all,
in real-world scenarios a useful metric for a classifier would be the anomaly-score-distance between all normal and anormal predictions in a test set. This would help to choose classifiers that are more confident in their predictions.

Do you know if a metric exists that implements this in 2-class classification problems?

@alexriedel1
Copy link
Contributor Author

Thats actually the mean (squared) error, also referred to as brier score In this scenario 🙄

@samet-akcay
Copy link
Contributor

@alexriedel1, do you think this is something we should be adding to the repo? If yes, would you be able to provide more detail regarding the use-case?

@samet-akcay samet-akcay added Metrics Metric Component. Enhancement New feature or request labels Apr 8, 2022
@alexriedel1
Copy link
Contributor Author

Let me elaborate a bit more. Neither the AUROC nor the F1 capture the prediction confidence. While we reached a state of total Recall of datasets like MVTec, it's rather important to be confident about predictions than just finding the right category. Especially in real world scenarios where a pipeline for anomaly detection needs to be set up that people use on a daily basis, it's important that minor variances don't lead to wrong predictions (which might be the case if a predictor is not very confident about its prediction).

The Brier score is basically the MSE (loss) of the predictions.

The following code illustrates the issue.

import torch
from torch import Tensor
from torchmetrics import Metric, F1, ROC
from torchmetrics.functional import auc

class AUROC(ROC):
    """Area under the ROC curve."""

    def compute(self) -> Tensor:
        """First compute ROC curve, then compute area under the curve.
        Returns:
            Value of the AUROC metric
        """
        tpr: Tensor
        fpr: Tensor

        fpr, tpr, _thresholds = super().compute()
        # TODO: use stable sort after upgrading to pytorch 1.9.x (https://github.com/openvinotoolkit/anomalib/issues/92)
        if not (torch.all(fpr.diff() <= 0) or torch.all(fpr.diff() >= 0)):
            return auc(fpr, tpr, reorder=True)  # only reorder if fpr is not increasing or decreasing
        return auc(fpr, tpr)

class Brier(Metric):
    """Brier Score."""

    def __init__(self, dist_sync_on_step=False):
        super().__init__(dist_sync_on_step=dist_sync_on_step)

        self.add_state("preds", default=[])
        self.add_state("target", default=[])

    def update(self, preds: torch.Tensor, target: torch.Tensor):
        """Pushes all predicitions to list."""
        assert preds.shape == target.shape
        self.preds.append(preds)
        self.target.append(target)

    def compute(self):
        """Compute the Brier Score (MSE).
        Returns:
            Value of the Brier Metric
        """
        preds = torch.cat(self.preds, dim=0).cpu()
        target = torch.cat(self.target, dim=0).cpu()
        y_true = (target == 1).long()
        return torch.mean((y_true - preds) ** 2)

brier = Brier()
f1 = F1()
auroc = AUROC()

preds = torch.Tensor([0.2, 0.2, 0.2, 0.9, 0.9, 0.9, 0.9])
true = torch.Tensor([0, 0, 0, 1, 1, 1, 1]).int()

brier.update(preds, true)
f1.update(preds, true)
auroc.update(preds, true)

print("GT", true)
print("High Confidence in prediction")
print("Preds", preds)
print(f"Brier Loss: {brier.compute():.3f}")
print(f"F1: {f1.compute()}")
print(f"AUROC: {auroc.compute()}")

preds = torch.Tensor([0.49, 0.49, 0.49, 0.51, 0.51, 0.51, 0.51])

brier.reset()
f1.reset()
auroc.reset()

brier.update(preds, true)
f1.update(preds, true)
auroc.update(preds, true)

print("Low Confidence in prediction")
print("Preds", preds)
print(f"Brier Loss: {brier.compute():.3f}")
print(f"F1: {f1.compute()}")
print(f"AUROC: {auroc.compute()}")
GT tensor([0, 0, 0, 1, 1, 1, 1], dtype=torch.int32)
High Confidence in prediction
Preds tensor([0.2000, 0.2000, 0.2000, 0.9000, 0.9000, 0.9000, 0.9000])
Brier Loss: 0.023
F1: 1.0
AUROC: 1.0
Low Confidence in prediction
Preds tensor([0.4900, 0.4900, 0.4900, 0.5100, 0.5100, 0.5100, 0.5100])
Brier Loss: 0.240
F1: 1.0
AUROC: 1.0

@samet-akcay
Copy link
Contributor

@alexriedel1, thanks for clarifying this. This would indeed be useful in certain usecases. We could definitely add this metric to the repo.

Our priority at the moment, however, is to make the metrics optional to the user. Metrics are currently hardcoded and computed for each run. Idealy, the user should be able to choose which metric they want to see. They could of course choose to see all of the metrics like what's happening now, but this is not so efficient in terms of memory use.

@alexriedel1
Copy link
Contributor Author

Our priority at the moment, however, is to make the metrics optional to the user. Metrics are currently hardcoded and computed for each run. Idealy, the user should be able to choose which metric they want to see. They could of course choose to see all of the metrics like what's happening now, but this is not so efficient in terms of memory use.

Should I write a PR addressing this?

@samet-akcay
Copy link
Contributor

Thanks @alexriedel1. @djdameln is already working on a design for this. Meanwhile you could maybe add the Brier score implementation if that's alright with you?

@alexriedel1
Copy link
Contributor Author

alexriedel1 commented Apr 12, 2022

@samet-akcay @djdameln
it would be great, if that implementation works in a way, where the torchmetric metrics can be defined in the config.yaml and are added to the pipeline respectively. This would be an easy way to add custom metrics that are already part of torchmetrics (like MSE).

In kind of pseudocode it would look like:

config.yaml

model:
  metrics: 
    -F1
    -MeanSquaredError

anomaly_module.py

import torchmetrics

img_metrics = [getattr(torchmetrics, module)() for module in self.hparams.model.metrics]
self.image_metrics = MetricCollection(img_metrics, prefix="image_").cpu()

I hope this makes my intention a bit clear.

@djdameln djdameln mentioned this issue Apr 13, 2022
11 tasks
@samet-akcay
Copy link
Contributor

@alexriedel1, thanks, it is clear. You could perhaps have a look at WiP PR #230.

@djdameln
Copy link
Contributor

Solved with #230. The desired metric can be included by adding MeanSquaredError to the list of metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement New feature or request Metrics Metric Component.
Projects
None yet
Development

No branches or pull requests

3 participants