Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] clarify that categorical features will be converted to integers internally #4959

Merged
merged 6 commits into from
Feb 20, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/Advanced-Topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ Categorical Feature Support
- Use ``categorical_feature`` to specify the categorical features.
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst#categorical_feature>`__.

- Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
- Categorical features will be cast to ``int32`` so they must be encoded as non-negative integers (negative values will be treated as missing)
less than ``Int32.MaxValue`` (2147483647).
It is best to use a contiguous range of integers started from zero.
Floating point numbers in categorical features will be rounded towards 0.

Expand Down
2 changes: 1 addition & 1 deletion docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -816,7 +816,7 @@ Dataset Parameters

- add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features

- **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
- **Note**: all values will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package)

- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``

Expand Down
2 changes: 1 addition & 1 deletion include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -697,7 +697,7 @@ struct Config {
// desc = used to specify categorical features
// desc = use number for index, e.g. ``categorical_feature=0,1,2`` means column\_0, column\_1 and column\_2 are categorical features
// desc = add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features
// desc = **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
// desc = **Note**: all values will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package)
// desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
// desc = **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
// desc = **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
Expand Down
4 changes: 2 additions & 2 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1155,7 +1155,7 @@ def __init__(self, data, label=None, reference=None,
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
jmoralez marked this conversation as resolved.
Show resolved Hide resolved
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down Expand Up @@ -3560,7 +3560,7 @@ def refit(
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down
4 changes: 2 additions & 2 deletions python-package/lightgbm/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def train(
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down Expand Up @@ -460,7 +460,7 @@ def cv(params, train_set, num_boost_round=100,
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
jmoralez marked this conversation as resolved.
Show resolved Hide resolved
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down
2 changes: 1 addition & 1 deletion python-package/lightgbm/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ def __call__(self, preds, dataset):
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down