You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In SMOTE-NC, the distance between categorical features is determined by the median of the standard deviation of the continuous features in the minority class.
This median value should be calculated on the minority class that is being over-sampled. So it should vary with the class.
In the current implementation, the median value is determined only for the one minority class, that is, the class with the least number of observations, and the same distance is used to over-sample all other minority classes.
The text was updated successfully, but these errors were encountered:
solegalli
changed the title
[BUG] in SMOTENC the median value should be that of the minority being over-sampled
[BUG] in SMOTENC the median value of the std of continuous variables should be determined in the minority class being over-sampled
Sep 9, 2021
Just noticed this as well, I agree. However, in the original SMOTE paper they only explain SMOTENC in a binary classification context. The expected behavior of the algorithm in a multiclass context is not described (as far as I could notice), but since the rest of the methods use 1-vs-all approaches I believe SMOTENC should probably follow the same logic.
In addition, I found one additional bug in the categorical feature encoding where the one-hot encoded features are multiplied by the median of the standard deviations:
In this case the values are divided by 2 since the features are one-hot encoded the median std is accounted for in the measuring of the euclidean distances twice. However, this is not sufficient since Med**2 = (Med/(2**(1/2)))**2 + (Med/(2**(1/2)))**2 != (Med/2)**2 + (Med/2)**2. In the current implementation the categorical features contribute half as much to the computation of the distances as they should (according to the original SMOTE paper). Therefore:
Describe the bug
In SMOTE-NC, the distance between categorical features is determined by the median of the standard deviation of the continuous features in the minority class.
This median value should be calculated on the minority class that is being over-sampled. So it should vary with the class.
In the current implementation, the median value is determined only for the one minority class, that is, the class with the least number of observations, and the same distance is used to over-sample all other minority classes.
The text was updated successfully, but these errors were encountered: