[questions] How to properly deal with categorical variables #4932

HarryAtDelphia · 2022-01-06T17:59:21Z

According to official documents, the categorical features are better to be encoded as non-negative integers in LightGBM.(https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html). I encoded the categorical features as non-negative integers using OrdinalEncoder, but when I converted pandas dataframe to numpy array, the features will be converted to float. My question is could LGBM properly treat categorical features as float? What is the best way to deal with categorical features?

I am using sklearn API. The versions are python 3.8, LightGBM 3.3.1, pandas 1.1.5, numpy 1.19.5 and scikit-learn 1.0.1.

Here is a sample of my code. Thank you for help.

oe = preprocessing.OrdinalEncoder(dtype=int, handle_unknown='use_encoded_value', unknown_value=999)
feature_df[categorical_features] = oe.fit(feature_df[categorical_features].astype(str))

X = np.array(feature_df)
y = np.squeeze(np.array(target_df))
model.fit(X, y, sample_weight=weights,
                       categorical_feature=categorical_features,
                       feature_name=feature_features)

The text was updated successfully, but these errors were encountered:

jmoralez · 2022-01-11T01:36:05Z

Hi @HarryAtDelphia. You don't have to worry about the categoricals being floats as long as you tell lightgbm that those features are meant to be treated as categoricals. Here's an example:

import lightgbm as lgb
import numpy as np


n_samples = 1_000
n_categoricals = 2
n_continuous = 2
categoricals = np.random.randint(0, 20, size=(n_samples, n_categoricals))
continuous = np.random.rand(n_samples, n_continuous)
X = np.hstack([categoricals, continuous])
print(X.dtype)  # float64
y = (X[:, 0] == 10) * X[:, -1]
model = lgb.LGBMRegressor(
    n_estimators=1,
    num_leaves=15,
    categorical_feature=np.arange(n_categoricals),
)
model.fit(X, y)
lgb.plot_tree(model)

which yieds the following

You can see here that the first split is asking whether the first feature is either 0, 10 or 12 (i.e. treating that feature as categorical).

HarryAtDelphia · 2022-01-11T02:19:36Z

Hi @HarryAtDelphia. You don't have to worry about the categoricals being floats as long as you tell lightgbm that those features are meant to be treated as categoricals. Here's an example:
import lightgbm as lgb
import numpy as np


n_samples = 1_000
n_categoricals = 2
n_continuous = 2
categoricals = np.random.randint(0, 20, size=(n_samples, n_categoricals))
continuous = np.random.rand(n_samples, n_continuous)
X = np.hstack([categoricals, continuous])
print(X.dtype)  # float64
y = (X[:, 0] == 10) * X[:, -1]
model = lgb.LGBMRegressor(
    n_estimators=1,
    num_leaves=15,
    categorical_feature=np.arange(n_categoricals),
)
model.fit(X, y)
lgb.plot_tree(model)
which yieds the following

You can see here that the first split is asking whether the first feature is either 0, 10 or 12 (i.e. treating that feature as categorical).

Thank you so much @jmoralez. This really solves my puzzle.

jmoralez · 2022-01-11T02:59:37Z

Thanks for raising this @HarryAtDelphia, I believe we should clarify this in the docs. Do you think that changing

Categorical features must be encoded as non-negative integers (int) less than Int32.MaxValue (2147483647).

to

Categorical features will be converted to int so they must be encoded as non-negative integers (negative values will be treated as missing) less than the maximum int32 value (2147483647).

makes it a bit clearer?

HarryAtDelphia · 2022-01-11T03:14:36Z

I agree with you @jmoralez. This will be a clearer introduction on categorical features. I am sure it will help more people new to lightGBM.

github-actions · 2023-08-23T00:20:04Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the question label Jan 6, 2022

jmoralez added the awaiting response label Jan 11, 2022

no-response bot removed the awaiting response label Jan 11, 2022

jmoralez mentioned this issue Jan 18, 2022

[docs] clarify that categorical features will be converted to integers internally #4959

Merged

shiyu1994 closed this as completed in #4959 Feb 20, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[questions] How to properly deal with categorical variables #4932

[questions] How to properly deal with categorical variables #4932

HarryAtDelphia commented Jan 6, 2022

jmoralez commented Jan 11, 2022

HarryAtDelphia commented Jan 11, 2022

jmoralez commented Jan 11, 2022 •

edited

Loading

HarryAtDelphia commented Jan 11, 2022

github-actions bot commented Aug 23, 2023

[questions] How to properly deal with categorical variables #4932

[questions] How to properly deal with categorical variables #4932

Comments

HarryAtDelphia commented Jan 6, 2022

jmoralez commented Jan 11, 2022

HarryAtDelphia commented Jan 11, 2022

jmoralez commented Jan 11, 2022 • edited Loading

HarryAtDelphia commented Jan 11, 2022

github-actions bot commented Aug 23, 2023

jmoralez commented Jan 11, 2022 •

edited

Loading