Learning over imbalanced data on the fly. #512

jbone · 2021-03-10T21:49:31Z

jbone
Mar 10, 2021

This is perhaps more of a general ML question, but here goes.

Givens: a 3-class classifier problem in which the balance between classes tends to change frequently and stay changed for an uncertain period. It’s not an anomaly problem in that the minority class is neither consistent other than in the short term nor is it particularly rare. There’s no meaningful way to determine a static weighting or distribution a priori.

What’s the current thinking on best practices for dealing with this?

Answered by MaxHalford

Mar 14, 2021

Is this desired distribution describing the a priori knowledge about the actual distribution of the data, or is it describing the ratios you want the classifier to see?

Sorry I wasn't clear: desired_dist is indeed the distribution we want the classifier to see.

The

if desired_dist is None:
    desired_dist = self._actual_dist

is basically an edge case: if no desired distribution is specified, the data is sampled completely at random.

Assume my a priori distribution of actual data is { -1: 0., 0: 0.8, 1: 0.15 }. Obviously the classes { -1, 1 } are going to be harder to learn. I (think) I want the underlying classifier (a OneVsRest-wrapped ALMA) to “see” each class an ~equal number of ti…

View full answer

MaxHalford · 2021-03-10T22:05:40Z

MaxHalford
Mar 10, 2021
Maintainer

Good question! I thought about this quite bit ~1.5 years ago and wrote a related blog post. We then implemented samplers that balance the data online in the imblearn module. These samplers maintain an estimate of the class distribution and accept/reject a sample based on its according weight. See this example notebook for some example usage.

The distributions maintained by these samples are just dictionaries that count the occurrences of each class. Therefore they don't adapt very well to a change in distribution. However, it should be very straightforward to make these counters adaptive by only measuring the distribution on the n latest samples.

Concretely, if I were to do this, I would implement a new distribution in the proba module (which deserves some love). Then we could plug this distribution into the samplers to make them adaptive.

Note that an advantage of using these counters is that you add a priori knowledge: just fill in the counters before running the model.

I hope this (partially) answers your question.

3 replies

jbone Mar 12, 2021
Author

Sorry for the delay, thanks for the answer. I have a sub-question now, though. In RandomSampler I see:

        if desired_dist is None:
            desired_dist = self._actual_dist

Since _actual_dist is maintained on the fly, wouldn’t this approximate the desired effect (ie, sampling based on the dynamically evolving distribution?)

jbone Mar 12, 2021
Author

A, der, no. Nm. What you’d really want would be sort of the inverse.

MaxHalford Mar 12, 2021
Maintainer

Well yes _actual_dist estimates the distribution on the fly, so it's great. But in the case you were mentioning the distribution can change brutally, in which case you want to measure the distribution on a recent window.

jbone · 2021-03-14T18:15:14Z

jbone
Mar 14, 2021
Author

Here’s a slightly different question. In the imblearn methods you supply a desired_dist parameter; a dictionary describing the desired distribution for use in the various resampling classifiers.

Is this desired distribution describing the a priori knowledge about the actual distribution of the data, or is it describing the ratios you want the classifier to see?

Assume my a priori distribution of actual data is { -1: 0., 0: 0.8, 1: 0.15 }. Obviously the classes { -1, 1 } are going to be harder to learn. I (think) I want the underlying classifier (a OneVsRest-wrapped ALMA) to “see” each class an ~equal number of times. Should my desired_dist be { -1: 0.33, 0: 0.34, 1: 0.33 }?

1 reply

jbone Mar 14, 2021
Author

Working through the math / code, I think the answer is no: simply not supplying a desired_dist will use the actual_dist to compute whether or not to over/under sample in order to achieve the balanced number of samples learned per-class. Is this right?

Per the original question regarding dynamicity, this would suggest a possible hack. If I maintain a rolling confusion matrix of appropriate period, could I then simply transform that periodically into the desired_dist dictionary and assign that to the desired_dict parameter in the imblearn wrapper class?

MaxHalford · 2021-03-14T18:37:44Z

MaxHalford
Mar 14, 2021
Maintainer

Is this desired distribution describing the a priori knowledge about the actual distribution of the data, or is it describing the ratios you want the classifier to see?

Sorry I wasn't clear: desired_dist is indeed the distribution we want the classifier to see.

The

if desired_dist is None:
    desired_dist = self._actual_dist

is basically an edge case: if no desired distribution is specified, the data is sampled completely at random.

Assume my a priori distribution of actual data is { -1: 0., 0: 0.8, 1: 0.15 }. Obviously the classes { -1, 1 } are going to be harder to learn. I (think) I want the underlying classifier (a OneVsRest-wrapped ALMA) to “see” each class an ~equal number of times. Should my desired_dist be { -1: 0.33, 0: 0.34, 1: 0.33 }?

Yes! Although I would recommend trying a more permissive desired distribution too, like { -1: 0.2, 0: 0.5, 1: 0.3 }.

0 replies

jbone · 2021-03-14T18:40:55Z

jbone
Mar 14, 2021
Author

Ah! The code(r) is smarter than I am. The gorpy hack can wait. Thanks, Max!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning over imbalanced data on the fly. #512

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Learning over imbalanced data on the fly. #512

jbone Mar 10, 2021

Replies: 4 comments · 4 replies

MaxHalford Mar 10, 2021 Maintainer

jbone Mar 12, 2021 Author

jbone Mar 12, 2021 Author

MaxHalford Mar 12, 2021 Maintainer

jbone Mar 14, 2021 Author

jbone Mar 14, 2021 Author

MaxHalford Mar 14, 2021 Maintainer

jbone Mar 14, 2021 Author

jbone
Mar 10, 2021

Replies: 4 comments 4 replies

MaxHalford
Mar 10, 2021
Maintainer

jbone Mar 12, 2021
Author

jbone Mar 12, 2021
Author

MaxHalford Mar 12, 2021
Maintainer

jbone
Mar 14, 2021
Author

jbone Mar 14, 2021
Author

MaxHalford
Mar 14, 2021
Maintainer

jbone
Mar 14, 2021
Author