Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode categorical columns #272

Closed
reza1615 opened this issue Sep 4, 2020 · 9 comments
Closed

Encode categorical columns #272

reza1615 opened this issue Sep 4, 2020 · 9 comments

Comments

@aschonfeld
Copy link
Collaborator

@reza1615 So I just started looking at the OneHotEncoder documentation and I'm wondering how this will work. Based on what I see in this article OneHotEncoder will end up returning an entirely new dataframe and not a column (series). Can you send over an example of how you would expect to use each of these? thanks

@reza1615
Copy link
Author

@aschonfeld please take a look the attached file for the encoders
encoders.zip

@aschonfeld
Copy link
Collaborator

So based on your OneHotEncoder example in encoders.ipynb you're not using OneHotEncoder from sklearn but pd.get_dummies?

@reza1615
Copy link
Author

Yes dummies is equal to onehot + drop the first onehot column

@aschonfeld
Copy link
Collaborator

@reza1615 quick question on OneHot encoder, why are we setting drop_first to True?

@reza1615
Copy link
Author

@aschonfeld : to avoid dependency among the variables.
If you don't drop the first column then your dummy variables will be correlated. This may affect some models adversely and the effect is stronger when the cardinality is smaller. For example iterative models may have trouble converging and lists of variable importances may be distorted.

@aschonfeld
Copy link
Collaborator

Maybe I'm doing something wrong then. I'm running this piece of code:

df = pd.DataFrame({"car": ["Honda", "Benze", "Ford", "Honda", "Benze", "Ford"]})
pd.get_dummies(data, columns=['car'], drop_first=True)

This only returns a dataframe with car_Ford & car_Honda. If this is expected then I'll move forward with the code. Maybe this scenario I'm testing isn't valid 🤦

@reza1615
Copy link
Author

reza1615 commented Oct 21, 2020

It is correct. we don't need car_benze. (in real life we need benze 😁)
because in this data when the car isn't ford or hunda it is benze. if we had car_benze we increased co-relation of data.

@aschonfeld
Copy link
Collaborator

Added in v1.19.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants