Skip to content

In SMOTENC - why the median std is halved to estimate the distance of the categorical features? #857

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
solegalli opened this issue Sep 2, 2021 · 1 comment · Fixed by #1014

Comments

@solegalli
Copy link
Contributor

In this line, when adding the median(std) to the OHE matrix to estimate the distance of categorical features, the median is divided by 2.

Is this a bug? or is this intentional? and if intentional, why?

thanks a lot!

@joaopfonseca
Copy link

Hey @solegalli it's been a while since you opened this issue, but I just replied to the other issue you opened. It's a bug in the sense that it should be divided by 2**(1/2) instead of 2. But it was done like this because the features are one-hot encoded, so when computing the euclidean distance between two observations with a different value in a categorical feature the summation of the squared differences would be Med**2. However, the way it is implemented, the importance of the categorical features are halved when compared to the SMOTENC implementation proposed by Chawla et al. But honestly I'm not even sure if I'm correct about this possible bug, this seems like something so simple that I'm afraid I might be saying something stupid...

Link to the reply of the other issue (where I also described this problem in a bit more detail I believe): #860 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants