Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification metrics consistency #655

Closed
mayeroa opened this issue Dec 5, 2021 · 5 comments · Fixed by #1195
Closed

Classification metrics consistency #655

mayeroa opened this issue Dec 5, 2021 · 5 comments · Fixed by #1195
Labels
enhancement New feature or request Important milestonish
Milestone

Comments

@mayeroa
Copy link

mayeroa commented Dec 5, 2021

🚀 Feature

First of all, thank you very much for this awesome project. It helps a lot evaluating deep learning models on standard and out-of-the-box metrics for several application domain.

My request(s) will be more focused on classification metrics (also applied to semantic segmentation). Maybe, we will need to split that into several issues. Do not hesitate.

Motivation

I have made some tests to check if all metrics taking the same inputs work as expected. I have multiple scenarios in mind, since semantic segmentation is a pixel-wise classification task :

  • Binary classification
  • Multi-class classification
  • Binary semantic segmentation
  • Multi-class semantic segmentation

Doing those tests (mentioned below), I noticed some inconsistency between the different metrics (shape of inputs, handling binary scenario, parameters name, parameters default values, etc.). The idea would be to standardize more the interface of each classification metrics.

Pitch

Let us take the different scenarii in mind:

Binary classification

import torch
from torchmetrics import *


# Inputs
num_classes = 2
logits = torch.randn((10, num_classes))
targets = torch.randint(0, num_classes, (10, ))
probabilities = torch.softmax(logits, dim=1)

# Compute classification metrics
metrics = MetricCollection({
    'acc': Accuracy(num_classes=num_classes),
    'average_precision': AveragePrecision(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes),
    'binned_average_precision': BinnedAveragePrecision(num_classes=num_classes, thresholds=5),
    'binned_precision_recall_curve': BinnedPrecisionRecallCurve(num_classes=num_classes, thresholds=5),
    'binned_recall_at_fixed_precision': BinnedRecallAtFixedPrecision(num_classes=num_classes, min_precision=0.5, thresholds=5),
    'calibration_error': CalibrationError(),
    'cohen_kappa': CohenKappa(num_classes=num_classes),
    'confusion_matrix': ConfusionMatrix(num_classes=num_classes),
    'f1': F1(num_classes=num_classes),
    'f2': FBeta(num_classes=num_classes, beta=2),
    'hamming_distance': HammingDistance(),
    'hinge': Hinge(),
    'iou': IoU(num_classes=num_classes),
    # 'kl_divergence': KLDivergence()
    'matthews_correlation_coef': MatthewsCorrcoef(num_classes=num_classes),
    'precision': Precision(num_classes=num_classes),
    'precision_recall_curve': PrecisionRecallCurve(num_classes=num_classes),
    'recall': Recall(num_classes=num_classes),
    'roc': ROC(num_classes=num_classes),
    'specificity': Specificity(num_classes=num_classes),
    'stat_scores': StatScores(num_classes=num_classes)
})
metrics.update(probabilities, targets)
metrics.compute()

Binary classification is acting as expected, except for the KLDivergence metric which complain with following exception:

RuntimeError: Predictions and targets are expected to have the same shape

I have several questions regarding the binary scenario:

  • What is considered the best practice? Having num_classes set to 2? Having set to 1 but some metrics complain about it (ConfusionMatrix, IoU)?
  • Is setting num_classes to 2 is considered as binary or multi-class mode? I imagine it depends on the metric.
  • Some metrics take num_classes as None value for binary mode, some others don't. It is not really clear, to my mind, how to handle properly binary classification with all those metrics.

Multi-class classification

This mode seems to work as expected, and being consistent since num_classes is clearly defined here. Hence, the same question, is computing the metrics with 2 classes equivalent to binary mode. If not, how to handle properly that mode.

Code that is properly working (expect for the KLDivergence, same issue as mentioned above).

import torch
from torchmetrics import *


# Inputs
num_classes = 5
logits = torch.randn((10, num_classes))
targets = torch.randint(0, num_classes, (10, ))
probabilities = torch.softmax(logits, dim=1)

# Compute classification metrics
metrics = MetricCollection({
    'acc': Accuracy(num_classes=num_classes),
    'average_precision': AveragePrecision(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes),
    'binned_average_precision': BinnedAveragePrecision(num_classes=num_classes, thresholds=5),
    'binned_precision_recall_curve': BinnedPrecisionRecallCurve(num_classes=num_classes, thresholds=5),
    'binned_recall_at_fixed_precision': BinnedRecallAtFixedPrecision(num_classes=num_classes, min_precision=0.5, thresholds=5),
    'calibration_error': CalibrationError(),
    'cohen_kappa': CohenKappa(num_classes=num_classes),
    'confusion_matrix': ConfusionMatrix(num_classes=num_classes),
    'f1': F1(num_classes=num_classes),
    'f2': FBeta(num_classes=num_classes, beta=2),
    'hamming_distance': HammingDistance(),
    'hinge': Hinge(),
    'iou': IoU(num_classes=num_classes),
    # 'kl_divergence': KLDivergence(),
    'matthews_correlation_coef': MatthewsCorrcoef(num_classes=num_classes),
    'precision': Precision(num_classes=num_classes),
    'precision_recall_curve': PrecisionRecallCurve(num_classes=num_classes),
    'recall': Recall(num_classes=num_classes),
    'roc': ROC(num_classes=num_classes),
    'specificity': Specificity(num_classes=num_classes),
    'stat_scores': StatScores(num_classes=num_classes)
})
metrics.update(probabilities, targets)
metrics.compute()

Binary semantic segmentation

Since TorchMetric supports extra dimensions for logits and targets, these classification metrics may also be used for semantic segmentation tasks. But for that type of task, I ended up with some issues.

Let us take an example of 10 images of 32x32 pixels.

import torch
from torchmetrics import *


# Inputs
num_classes = 2
logits = torch.randn((10, num_classes, 32, 32))
targets = torch.randint(0, num_classes, (10, 32, 32))
probabilities = torch.softmax(logits, dim=1)

# Compute classification metrics
metrics = MetricCollection({
    'acc': Accuracy(num_classes=num_classes),
    'average_precision': AveragePrecision(num_classes=num_classes),
    'auroc': AUROC(num_classes=num_classes),
    # 'binned_average_precision': BinnedAveragePrecision(num_classes=num_classes, thresholds=5),
    # 'binned_precision_recall_curve': BinnedPrecisionRecallCurve(num_classes=num_classes, thresholds=5),
    # 'binned_recall_at_fixed_precision': BinnedRecallAtFixedPrecision(num_classes=num_classes, min_precision=0.5, thresholds=5),
    'calibration_error': CalibrationError(),
    'cohen_kappa': CohenKappa(num_classes=num_classes),
    'confusion_matrix': ConfusionMatrix(num_classes=num_classes),
    'f1': F1(num_classes=num_classes, mdmc_average='global'),
    'f2': FBeta(num_classes=num_classes, beta=2, mdmc_average='global'),
    'hamming_distance': HammingDistance(),
    # 'hinge': Hinge(),
    'iou': IoU(num_classes=num_classes),
    # 'kl_divergence': KLDivergence(),
    'matthews_correlation_coef': MatthewsCorrcoef(num_classes=num_classes),
    'precision': Precision(num_classes=num_classes, mdmc_average='global'),
    'precision_recall_curve': PrecisionRecallCurve(num_classes=num_classes),
    'recall': Recall(num_classes=num_classes, mdmc_average='global'),
    'roc': ROC(num_classes=num_classes),
    'specificity': Specificity(num_classes=num_classes, mdmc_average='global'),
    'stat_scores': StatScores(num_classes=num_classes, mdmc_reduce='global')
})
metrics.update(probabilities, targets)
metrics.compute()
  1. First of all, BinnedAveragePrecision, BinnedPrecisionRecallCurve and BinnedRecallAtFixedPrecision are failing with following exception:
RuntimeError: The size of tensor a (2) must match the size of tensor b (32) at non-singleton dimension 2

While it is stated in the documentation (https://torchmetrics.readthedocs.io/en/latest/references/modules.html#binnedrecallatfixedprecision) that forward should accept that type of format (logits = (N, C, ...) and targets = (N, ...)).

  1. Accuracy has a default value for mdmc_average set to global, while other metrics (Precision, FBeta, Specificity, etc) have it set to None -> Need to be consistent on this. Either everything is set to None or global. I would argue that setting the default to global would be ideal since the user would be able to use those metrics for classification or semantic segmentation tasks seamlessly.

  2. I notice that StatScores uses mdmc_reduce parameter while the other metrics called it mdmc_average. I think, it would be suitable to be consistent over the name of this parameter. The easiest change would be to adopt definitely mdmc_average(since only StatScores would need to be updated).

  3. Once again, how should we properly deal with binary data. For the example, I use a 2-classes workaround but is there a better of handling binary semantic segmentation tasks?

Multi-class semantic segmentation

The scenario is the same as the previous one while setting num_classes to a higher value than 2.

Conclusion

This issue gathers different topics that may be need to be treated separately but have in common API consistency.
We can start a discussion about it, but my main question, regarding that long text, is how to properly deal with binary classification and semantic segmentation.

  • Is using 2-classes a suitable workaround?
  • Or should we be compliant with BCELossWithLogits interface for example, meaning having the probabilities and targets having the exact same shape (for instance (10, 32, 32) in my example).

I would like to have your thoughts about that. :)
And sorry for that long issue but I wanted to be as clear as possible, providing meaningful example.

Alternatives

Document more on how each metric should be used in binary tasks.

Additional context

The main idea of all those requests is to ease the use of those metrics and standardize interfaces (shape of inputs, num_classes parameters, etc.). I understand that is a tough topic but it matters.

@mayeroa mayeroa added the enhancement New feature or request label Dec 5, 2021
@github-actions
Copy link

github-actions bot commented Dec 5, 2021

Hi! thanks for your contribution!, great first issue!

@mayeroa
Copy link
Author

mayeroa commented Dec 6, 2021

Additional question, if I am correct, TorchMetrics handles those 4 scenarii for now, based on that: https://github.com/PyTorchLightning/metrics/blob/6bcc8b0f7f52b370583aff4a9192d7a399f3da75/torchmetrics/utilities/checks.py#L73

# Get the case
if preds.ndim == 1 and preds_float:
    case = DataType.BINARY
elif preds.ndim == 1 and not preds_float:
    case = DataType.MULTICLASS
elif preds.ndim > 1 and preds_float:
    case = DataType.MULTILABEL
else:
    case = DataType.MULTIDIM_MULTICLASS

Would MULTIDIM_BINARY make sense to handle properly binary semantic segmentation?

@SkafteNicki
Copy link
Member

Hi @mayeroa, thanks for raising this issue.
You are completely right that there are some problems with the consistency of our classification metrics, which are becoming more and more clear as we gain more users.

The optimal case would be that all metrics support binary, multi-label and multi-class classification and that the user can have as many additional dimensions as the want for each metric. However, this is definitely not the case at the moment.

I think it is pretty clear to me that the hole classification package need to be refactored, and IMO we need to add an argument called task (or something) which can be either binary, multi-label, multi-class which the user needs to provide. However, this is a huge undertaking and I am not sure when we get the bandwidth to do it.

@Borda
Copy link
Member

Borda commented Jan 6, 2022

cc: @aribornstein @ethanwharris

@Borda Borda added the Important milestonish label Jan 6, 2022
@Borda Borda added this to the v0.8 milestone Jan 6, 2022
@Borda Borda changed the title [Feature Request] Classification metrics consistency Classification metrics consistency Jan 26, 2022
@Borda Borda modified the milestones: v0.8, v0.9 Mar 22, 2022
@SkafteNicki SkafteNicki modified the milestones: v0.9, v0.10 May 12, 2022
@SkafteNicki
Copy link
Member

Issue will be fixed by classification refactor: see this issue #1001 and this PR #1195 for all changes

Small recap: This issue describes a number of inconsistencies in the classification package especially regarding how binary tasks are dealt with, both for single dimensional input but also multi dimensional input (like in segmentation). In the refactor all metrics have been split into a binary_*, multiclass_* and multilabel_* version that all follows the same interface to make the metrics consistent. To be precise regarding this issue:

  • For Binary classification -> use binary_* metrics
  • Multi-class classification -> use multiclass_* metrics
  • Binary semantic segmentation -> use binary_ metrics
  • Multi-class semantic segmentation -> use multiclass_* metrics

Issue will be closed when #1195 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Important milestonish
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants