Skip to content

[BUG] in SMOTENC the median value of the std of continuous variables should be determined in the minority class being over-sampled #860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
solegalli opened this issue Sep 9, 2021 · 1 comment · Fixed by #1015

Comments

@solegalli
Copy link
Contributor

Describe the bug

In SMOTE-NC, the distance between categorical features is determined by the median of the standard deviation of the continuous features in the minority class.

This median value should be calculated on the minority class that is being over-sampled. So it should vary with the class.

In the current implementation, the median value is determined only for the one minority class, that is, the class with the least number of observations, and the same distance is used to over-sample all other minority classes.

@solegalli solegalli changed the title [BUG] in SMOTENC the median value should be that of the minority being over-sampled [BUG] in SMOTENC the median value of the std of continuous variables should be determined in the minority class being over-sampled Sep 9, 2021
@joaopfonseca
Copy link

joaopfonseca commented Jun 22, 2022

Just noticed this as well, I agree. However, in the original SMOTE paper they only explain SMOTENC in a binary classification context. The expected behavior of the algorithm in a multiclass context is not described (as far as I could notice), but since the rest of the methods use 1-vs-all approaches I believe SMOTENC should probably follow the same logic.

In addition, I found one additional bug in the categorical feature encoding where the one-hot encoded features are multiplied by the median of the standard deviations:

https://github.com/scikit-learn-contrib/imbalanced-learn/blob/master/imblearn/over_sampling/_smote/base.py#L575

In this case the values are divided by 2 since the features are one-hot encoded the median std is accounted for in the measuring of the euclidean distances twice. However, this is not sufficient since Med**2 = (Med/(2**(1/2)))**2 + (Med/(2**(1/2)))**2 != (Med/2)**2 + (Med/2)**2. In the current implementation the categorical features contribute half as much to the computation of the distances as they should (according to the original SMOTE paper). Therefore:

X_ohe.data = np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / 2

Should be:

X_ohe.data = np.ones_like(X_ohe.data, dtype=X_ohe.dtype) * self.median_std_ / np.sqrt(2)

This might change the behavior of SMOTENC significantly. Is my reasoning correct? If so I would be happy to open a PR to fix both problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants