Classification refactor #1001

SkafteNicki · 2022-05-03T08:47:42Z

The classification package is long overdue for a refactor as we are seeing a rising number of issues that either request new features that are hard to implement in the current codebase, a disagreement between what users expect the metrics are doing and what they are actually doing.

A full list of issues marked that should be taken care of with the refactor can be found here

The refactor hope to adress the following problems:

Maintainability: The core part of the classification metrics was written back when torchmetrics was pl.metrics by an contributor. We should maybe have been more thoroughly in the review phase, because the code has been hard to maintain. This refactor should hopefully help adress this by lowering code complexity.
Consistency: The classification package have a number of consistency issues. This issue gives an overview of some of the consistency issues, but essentially num_classes=2 sometimes means doing binary classification and sometimes means multiclass classification (which differs in their definition) depending on what metric you are using.
Expectations: There are a number of cases where the default choice of arguments currently does not match users expection, because we differ a bit from how sklearn are handeling some cases. This refactor will adress these differences.
Performance: we have received feedback that our implementations are very slow. While these comparisons are not always fair (comparing a one line implementation of accuracy vs. our modular implementation which is much more general), it is fair to say that some improvements can be made.

Proposed solution

The proposed solution is to split each metric into three seperate metric instances

BinaryMetricName
MultiClassMetricName
MultiLabelMetricName

For example Accuracy will be split into BinaryAccuracy, MultiClassAccuracy, MultiLabelAccuracy. This solution directly solves a number of problems:

While most metrics indeed support all three tasks, some only support a single or 2. Ranaming and splitting all metrics will make it much more clear what metric support what mode.
Some performance is lost on during a lot of input validation. By forcing the user beforehand to specify what task we are evaluating we can reduce the amount of input validation needed.
Code complexity: instead of all metrics essentially having if-else statements based on what task we are trying to solve, each metric will have a much more clear computational path.
Input arguments: take threshold and topk as examples, which are current arguments to the Accuracy, F1 ect. metrics. threshold should only be set for binary and multilabel and topk should only be specified for multiclass. Dividing into seperate metrics helps communicate what arguments have a influence on the computations going on.

Alternatives

We keep everything in one class but introduce an new (required) argument:
```
class Accuracy(Metric):
	def __init__(
		self, 
		mode: Literal['binary', 'multiclass', 'multilabel']
		...
	):
		...
```
This alternative is not directly in opposition to the main proposed solution. If requested by users we could still provide a single class that just wraps the three individual metric classes into 1.
We keep the outer API the exact same and try to clean up the internals. This will most likely only address some of the current problems.

Integration

PL: should really not be a problem as core PL does not depend on the classification package, only on the base torchmetrics.Metric class, which should not need to be touched doing the refactor. Some examples may need to be updated.
Flash: some changes would need to be made. We still expect that it will be minimal as initialization of a flash Task should already contain all information necessary to determine what class should be used. cc: @ethanwharris

Deprecation

The goal is to have the hole classification package refactored/cleaned up as the major work in 0.10. All current classification metrics will be given a deprecation warning and users will have until 0.11 to refactor their code to use the new classes.

While we are developing the new package we will have a freeze on new metrics in the classification package. We will still happily accept new metrics for other domains.

Documentation impact

Up until v0.8, this change would have made our documentation very annoying to scroll through as everything was in one central page. This change would essentially make the documentation for classification 3 times harder to navigate.

However, from v0.8 we changed it to have one page per metric. For this refactor we would keep one page per core metric e.g. Accuracy, Precision, Recall etc. and each page would then list every version of the metric.

Development

The development can essentially be divided into 3 phases:

Development of a generalized StatScore and ConfusionMatrix class for all three modes. Many classification metrics can be calculated from these statistics.
Subclass the generalized classes into specific metrics.
Deal with metrics that do not fall into the one of the two generalized classes (about 1/3 of the classification metrics currently)

Main part of the refactor will be done by @SkafteNicki and @justusschock, with support from the rest of the core metrics team. We may be open for contributions for step 2 as it should be fairly simple sub classing and copy-paste work. Development should start within 2 weeks time.

Any feedback is appreciated :)

The text was updated successfully, but these errors were encountered:

awaelchli · 2022-05-03T11:16:48Z

I like that we are thinking of simplifying this for the users. The classification module has grown organically - many more options have been added over time. I agree a refactor is appropriate here.

I like the alternative approach 1 where we keep the current basic classes as light wrappers. These are easy to remember and frequently used (Accuracy, F1, Precision ...) by a large user base. If we go with the wrapper class and do validation inside of them, perhaps we can recommend the specific classes BinaryMetricName, MultiClassMetricName, MultiLabelMetricName for the depending on the error and use case.

ethanwharris · 2022-05-03T11:58:12Z

Sounds good, I agree with @awaelchli that it would be nice to keep the base Accuracy etc. classes with e.g. the mode argument seems reasonable. Flash can be updated to work with either API 😃

Yura52 · 2022-06-01T09:40:25Z

While it is not directly related to the discussion, in theory, this issue may also be relevant, since it highlights one more aspect of accuracy-like metrics that may be taken into account during the refactoring.

adamjstewart · 2024-08-04T11:28:18Z

All current classification metrics will be given a deprecation warning and users will have until 0.11 to refactor their code to use the new classes.

It seems like this plan was aborted at some point? Can someone point me to the discussion where this happened. Want to clarify if the old metrics are discouraged or if the plan is to support both old and new for the indefinite future.

awaelchli · 2024-08-04T16:21:15Z

@adamjstewart The PR that deprecated things was #1195. This was 2 years ago and the deprecation messages are in there and since then, the classes marked for removal have been removed already.

I guess maybe you are referring to the classes flavors like Accuracy, MultiClassAccuracy, BinaryAccuracy etc? I am not aware of any plans to deprecate there. For example, BinaryAccuracy is simply equivalent to Accuracy(task="binary"), and MultiClassAccuracy is equivalent to Accuracy(task="multiclass"). To do it this way is what the discussion here converged to and I am not aware of any change in plans.

adamjstewart · 2024-08-04T16:27:09Z

Ah, gotcha. I interpreted this as deprecating Accuracy and replacing it with {Binary,Multi{class,label}}Accuracy. Good to know that both will be supported!

Borda · 2024-08-04T17:57:10Z

Ah, gotcha. I interpreted this as deprecating Accuracy and replacing it with {Binary,Multi{class,label}}Accuracy. Good to know that both will be supported!

This correct as we had in past many confused users which type was used, expecially with some edge cases within first batch...
The we had to keep it in the codebase due to compatibility 🐰

Borda · 2024-08-05T09:48:01Z

@lantiga @SkafteNicki are we considering dropping the old metrics in the future 2.0?

adamjstewart · 2024-08-06T12:18:04Z

I also see references like "Legacy Example" in https://lightning.ai/docs/torchmetrics/stable/classification/accuracy.html that suggest that the previous style is discouraged. I personally think the old style is very convenient, especially for generic LightningModules like torchgeo.trainers.SemanticSegmentationTask, which could support binary/multiclass/multilabel with only minor changes.

Borda · 2024-08-06T15:04:28Z

I personally think the old style is very convenient, especially for generic LightningModules like

I totally agree but also with the luxury or not needed thinking about the task came frustration that results are not correct... for example you have classification and the shuffle is bad in a few first batches come with labels 0 so the metrics are set up as binary but the task is multi-label in real and other labels comes later...
as you can see in the PR it resolved handful issues, but for compatibility we kept old API 🐰

adamjstewart · 2024-08-06T15:10:17Z

few first batches come with labels 0 so the metrics are set up as binary

Oh I mean auto-detection is bad, but asking the user to specify a task for a single class is nice.

SkafteNicki added Important milestonish refactoring refactoring and code health labels May 3, 2022

justusschock assigned SkafteNicki and justusschock May 3, 2022

Borda added this to the v0.10 milestone May 3, 2022

justusschock mentioned this issue May 5, 2022

added micro average option for torch metrics #874

Merged

4 tasks

This was referenced May 24, 2022

Precision fails when using negative ignored index #1040

Closed

functional.fbeta_score memory use #1053

Closed

SkafteNicki pinned this issue May 31, 2022

SkafteNicki mentioned this issue Jun 28, 2022

accuracy, recall, precision and f1-score are equal #1113

Closed

awaelchli mentioned this issue Jun 29, 2022

[Refactor] Classification 1/n #1054

Merged

4 tasks

This was referenced Jul 13, 2022

[Refactor] Classification 2/n #1143

Merged

[Refactor] Classification 3/n #1145

Merged

[Refactor] Classification 4/n #1151

Merged

SkafteNicki mentioned this issue Jul 22, 2022

[Refactor] Classification 5/n #1159

Merged

4 tasks

stancld mentioned this issue Jul 27, 2022

Support sequence tagging evaluation metrics (NLP) #1158

Open

This was referenced Jul 28, 2022

[Refactor] Classification 6/n #1163

Merged

[Refactor] Classification 7/n #1167

Merged

SkafteNicki mentioned this issue Aug 5, 2022

[Refactor] Classification 8/n #1175

Merged

4 tasks

This was referenced Aug 17, 2022

Cannot import perplexity #1188

Closed

[Refactor] Classification 9/n #1189

Merged

This was referenced Aug 24, 2022

Support float targets for classification metrics #449

Closed

Classification Refactor [rebase & merge] #1195

Merged

This was referenced Aug 28, 2022

Refactor/classification 10 #1197

Merged

Accuracy outputs conflicting results in a binary classification setting #782

Closed

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting. #746

Closed

Borda closed this as completed in #1195 Sep 13, 2022

SkafteNicki unpinned this issue Sep 13, 2022

This was referenced Sep 15, 2022

AUC does not support compute_on_cpu in DDP. #1088

Closed

Accuracy fails when: 1) fp16 is enabled, and 2) multi-dimensional, multi-class #1228

Closed

sallycaoyu mentioned this issue Feb 27, 2023

Possible improvements for Accuracy pytorch/ignite#1089

Open

SkafteNicki mentioned this issue Apr 17, 2023

[Segmentation] Added generalized dice score metric #1090

Merged

4 tasks

adamjstewart mentioned this issue Aug 4, 2024

Multiclass Classification: assert num_classes >=2 microsoft/torchgeo#2205

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification refactor #1001

Classification refactor #1001

SkafteNicki commented May 3, 2022

awaelchli commented May 3, 2022

ethanwharris commented May 3, 2022

Yura52 commented Jun 1, 2022

adamjstewart commented Aug 4, 2024

awaelchli commented Aug 4, 2024

adamjstewart commented Aug 4, 2024

Borda commented Aug 4, 2024

Borda commented Aug 5, 2024

adamjstewart commented Aug 6, 2024

Borda commented Aug 6, 2024

adamjstewart commented Aug 6, 2024

Classification refactor #1001

Classification refactor #1001

Comments

SkafteNicki commented May 3, 2022

Proposed solution

Alternatives

Integration

Deprecation

Documentation impact

Development

awaelchli commented May 3, 2022

ethanwharris commented May 3, 2022

Yura52 commented Jun 1, 2022

adamjstewart commented Aug 4, 2024

awaelchli commented Aug 4, 2024

adamjstewart commented Aug 4, 2024

Borda commented Aug 4, 2024

Borda commented Aug 5, 2024

adamjstewart commented Aug 6, 2024

Borda commented Aug 6, 2024

adamjstewart commented Aug 6, 2024