Encode categorical columns #272

reza1615 · 2020-09-04T16:40:30Z

It would be helpful to have column encoding as new column or for some cases inplace. Most common algorithms are

aschonfeld · 2020-10-09T19:51:14Z

@reza1615 So I just started looking at the OneHotEncoder documentation and I'm wondering how this will work. Based on what I see in this article OneHotEncoder will end up returning an entirely new dataframe and not a column (series). Can you send over an example of how you would expect to use each of these? thanks

reza1615 · 2020-10-11T13:22:46Z

@aschonfeld please take a look the attached file for the encoders
encoders.zip

aschonfeld · 2020-10-18T03:10:51Z

So based on your OneHotEncoder example in encoders.ipynb you're not using OneHotEncoder from sklearn but pd.get_dummies?

reza1615 · 2020-10-18T03:16:22Z

Yes dummies is equal to onehot + drop the first onehot column

aschonfeld · 2020-10-21T03:02:59Z

@reza1615 quick question on OneHot encoder, why are we setting drop_first to True?

reza1615 · 2020-10-21T10:38:34Z

@aschonfeld : to avoid dependency among the variables.
If you don't drop the first column then your dummy variables will be correlated. This may affect some models adversely and the effect is stronger when the cardinality is smaller. For example iterative models may have trouble converging and lists of variable importances may be distorted.

aschonfeld · 2020-10-21T14:49:13Z

Maybe I'm doing something wrong then. I'm running this piece of code:

df = pd.DataFrame({"car": ["Honda", "Benze", "Ford", "Honda", "Benze", "Ford"]})
pd.get_dummies(data, columns=['car'], drop_first=True)

This only returns a dataframe with car_Ford & car_Honda. If this is expected then I'll move forward with the code. Maybe this scenario I'm testing isn't valid 🤦

reza1615 · 2020-10-21T15:18:44Z

It is correct. we don't need car_benze. (in real life we need benze 😁)
because in this data when the car isn't ford or hunda it is benze. if we had car_benze we increased co-relation of data.

aschonfeld · 2020-10-24T17:33:43Z

Added in v1.19.0

aschonfeld added a commit that referenced this issue Oct 18, 2020

#272: encoders for categorical columns

eaab8ce

aschonfeld added a commit that referenced this issue Oct 21, 2020

#272: encoders for categorical columns

1cec829

aschonfeld added a commit that referenced this issue Oct 24, 2020

#272: encoders for categorical columns

fd406bb

aschonfeld closed this as completed Oct 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode categorical columns #272

Encode categorical columns #272

reza1615 commented Sep 4, 2020

aschonfeld commented Oct 9, 2020

reza1615 commented Oct 11, 2020

aschonfeld commented Oct 18, 2020

reza1615 commented Oct 18, 2020

aschonfeld commented Oct 21, 2020

reza1615 commented Oct 21, 2020

aschonfeld commented Oct 21, 2020

reza1615 commented Oct 21, 2020 •

edited

Loading

aschonfeld commented Oct 24, 2020

Encode categorical columns #272

Encode categorical columns #272

Comments

reza1615 commented Sep 4, 2020

aschonfeld commented Oct 9, 2020

reza1615 commented Oct 11, 2020

aschonfeld commented Oct 18, 2020

reza1615 commented Oct 18, 2020

aschonfeld commented Oct 21, 2020

reza1615 commented Oct 21, 2020

aschonfeld commented Oct 21, 2020

reza1615 commented Oct 21, 2020 • edited Loading

aschonfeld commented Oct 24, 2020

reza1615 commented Oct 21, 2020 •

edited

Loading