Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Chebyshev approaches #894

Merged
merged 1 commit into from
Mar 28, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
65 changes: 40 additions & 25 deletions river/imblearn/chebyshev.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,13 +18,17 @@ class ChebyshevUnderSampler(base.Wrapper, base.Regressor):
for an observation $y$ becomes: $P(|y - \\overline{y}|=t) = \\dfrac{\\sigma^2}{|y-\\overline{y}|}$.
The reciprocal of this probability is used for under-sampling[^1] the most frequent cases. Extreme
valued or rare cases have higher probabilities of selection, whereas the most frequent cases are
likely to be discarded.
likely to be discarded. Still, frequent cases have a small chance of being selected (controlled via
the `sp` parameter) in case few rare instances were observed.


Parameters
----------
regressor
The regression model that will receive the biased sample.
sp
Second chance probability. Even if an example is not initially selected for training, it still has
a small chance of being selected in case the number of rare case observed so far is small.
seed
Random seed to support reproducibility.

Expand Down Expand Up @@ -53,9 +57,9 @@ class ChebyshevUnderSampler(base.Wrapper, base.Regressor):
... metrics.MAE(),
... print_every=500
... )
[500] MAE: 1.84619
[1,000] MAE: 1.516441
MAE: 1.515879
[500] MAE: 1.633571
[1,000] MAE: 1.460907
MAE: 1.4604

References
----------
Expand All @@ -64,13 +68,17 @@ class ChebyshevUnderSampler(base.Wrapper, base.Regressor):

"""

def __init__(self, regressor: base.Regressor, seed: int = None):
def __init__(self, regressor: base.Regressor, sp: float = 0.15, seed: int = None):
self.regressor = regressor
self.sp = sp
self.seed = seed

self._var = stats.Var()
self._rng = random.Random(self.seed)

self._freq_c = 0
self._rare_c = 0

@property
def _wrapped_model(self):
return self.regressor
Expand All @@ -79,21 +87,27 @@ def predict_one(self, x):
return self.regressor.predict_one(x)

def learn_one(self, x, y, **kwargs):
var = self._var.get()
sd = var**0.5
self._var.update(y)
sd = self._var.get() ** 0.5
Copy link
Member

@MaxHalford MaxHalford Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smastelini I think you have to upgrade black, the latest convention is self._var.get()**0.5

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, thanks! Let me check that right now

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's strange @MaxHalford, I even tried to reinstall my conda environment (and installed the latest black version) and the hooks don't catch that. Note that all tests passed in the main repo

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smastelini If you use Vscode, you have to check the version of Python linked to it on the bottom right when viewing a Python script. Maybe it's not the environment where you installed the last version of Black.

image

Maybe it will solve the issue 🤔

Copy link
Member Author

@smastelini smastelini Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, black seems to do some choices here.

If I get the result from var in an auxiliary variable I get:

var = self._var.get(y)
sd = var**0.5

Adding spaces around ** will make the hook fail. But, if I get the the result directly, then the formatting becomes sd = self._var.get() ** 0.5. Removing the space will make the hook also fail.

Interesting choices. Look at the PR title in black's repo: Hug power operators if its operands are "simple".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply @raphaelsty! Actually, I ran everything on my terminal hahah

I got used to doing everything this way, git, pre-commit hooks and so on :)


if sd > 0:
mean = self._var.mean.get()
dev = abs(y - mean)
dev = abs(y - mean) # noqa
t = dev / sd
if t > 1:
prob_train = 1 - (var / (dev**2))
p = self._rng.random()

if p < prob_train:
self.regressor.learn_one(x, y, **kwargs)
# Small values for rare cases and 1 for frequent cases
prob_threshold = 1 / (t * t) if t > 1 else 1
p = self._rng.random()

if p >= prob_threshold:
self.regressor.learn_one(x, y, **kwargs)
self._rare_c += 1
elif self._freq_c < self._rare_c and p <= self.sp:
self.regressor.learn_one(x, y, **kwargs)
self._freq_c += 1
else:
self.regressor.learn_one(x, y, **kwargs)

self._var.update(y)
return self

@classmethod
Expand All @@ -116,8 +130,8 @@ class ChebyshevOverSampler(base.Wrapper, base.Regressor):

Alternatively, one can use $t$ directly to estimate a frequency weight $\\kappa = \\lceil t\\rceil$
and define an over-sampling strategy for extreme and rare target values[^1]. Each incoming instance is
used $\\kappa$ times to update the underlying regressor, in case $t > 1$. Otherwise, the instance is
ignored by the wrapped regression model.
used $\\kappa$ times to update the underlying regressor. Frequent target values contribute only once
to the underlying regressor, whereas rares cases are used multiple times for training.


Parameters
Expand Down Expand Up @@ -149,9 +163,9 @@ class ChebyshevOverSampler(base.Wrapper, base.Regressor):
... metrics.MAE(),
... print_every=500
... )
[500] MAE: 2.131883
[1,000] MAE: 1.496747
MAE: 1.496013
[500] MAE: 1.152726
[1,000] MAE: 0.954873
MAE: 0.954049

References
----------
Expand All @@ -173,21 +187,22 @@ def predict_one(self, x):
return self.regressor.predict_one(x)

def learn_one(self, x, y, **kwargs):
self._var.update(y)
var = self._var.get()
sd = var**0.5

if sd > 0:
mean = self._var.mean.get()
dev = abs(y - mean)
dev = abs(y - mean) # noqa
t = dev / sd

if t > 1:
kappa = int(math.ceil(t))
kappa = int(math.ceil(t))

for k in range(kappa):
self.regressor.learn_one(x, y, **kwargs)
for k in range(kappa):
self.regressor.learn_one(x, y, **kwargs)
else:
self.regressor.learn_one(x, y, **kwargs)

self._var.update(y)
return self

@classmethod
Expand Down