-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[questions] How to properly deal with categorical variables #4932
Comments
Hi @HarryAtDelphia. You don't have to worry about the categoricals being floats as long as you tell lightgbm that those features are meant to be treated as categoricals. Here's an example: import lightgbm as lgb
import numpy as np
n_samples = 1_000
n_categoricals = 2
n_continuous = 2
categoricals = np.random.randint(0, 20, size=(n_samples, n_categoricals))
continuous = np.random.rand(n_samples, n_continuous)
X = np.hstack([categoricals, continuous])
print(X.dtype) # float64
y = (X[:, 0] == 10) * X[:, -1]
model = lgb.LGBMRegressor(
n_estimators=1,
num_leaves=15,
categorical_feature=np.arange(n_categoricals),
)
model.fit(X, y)
lgb.plot_tree(model) You can see here that the first split is asking whether the first feature is either 0, 10 or 12 (i.e. treating that feature as categorical). |
Thank you so much @jmoralez. This really solves my puzzle. |
Thanks for raising this @HarryAtDelphia, I believe we should clarify this in the docs. Do you think that changing
to
makes it a bit clearer? |
I agree with you @jmoralez. This will be a clearer introduction on categorical features. I am sure it will help more people new to lightGBM. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
According to official documents, the categorical features are better to be encoded as non-negative integers in LightGBM.(https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html). I encoded the categorical features as non-negative integers using OrdinalEncoder, but when I converted pandas dataframe to numpy array, the features will be converted to float. My question is could LGBM properly treat categorical features as float? What is the best way to deal with categorical features?
I am using sklearn API. The versions are python 3.8, LightGBM 3.3.1, pandas 1.1.5, numpy 1.19.5 and scikit-learn 1.0.1.
Here is a sample of my code. Thank you for help.
The text was updated successfully, but these errors were encountered: