Skip to content

Commit

Permalink
[docs] clarify that categorical features will be converted to integer…
Browse files Browse the repository at this point in the history
…s internally (#4959)

* clarify that categoricals will be converted to ints and not that they should be ints in the input data

* update remaining sections

* update config.h

* add suggestions
  • Loading branch information
jmoralez authored Feb 20, 2022
1 parent 0db573c commit 820ae7e
Show file tree
Hide file tree
Showing 6 changed files with 9 additions and 8 deletions.
3 changes: 2 additions & 1 deletion docs/Advanced-Topics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ Categorical Feature Support
- Use ``categorical_feature`` to specify the categorical features.
Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst#categorical_feature>`__.

- Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
- Categorical features will be cast to ``int32`` so they must be encoded as non-negative integers (negative values will be treated as missing)
less than ``Int32.MaxValue`` (2147483647).
It is best to use a contiguous range of integers started from zero.
Floating point numbers in categorical features will be rounded towards 0.

Expand Down
2 changes: 1 addition & 1 deletion docs/Parameters.rst
Original file line number Diff line number Diff line change
Expand Up @@ -816,7 +816,7 @@ Dataset Parameters

- add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features

- **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
- **Note**: all values will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package)

- **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``

Expand Down
2 changes: 1 addition & 1 deletion include/LightGBM/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -697,7 +697,7 @@ struct Config {
// desc = used to specify categorical features
// desc = use number for index, e.g. ``categorical_feature=0,1,2`` means column\_0, column\_1 and column\_2 are categorical features
// desc = add a prefix ``name:`` for column name, e.g. ``categorical_feature=name:c1,c2,c3`` means c1, c2 and c3 are categorical features
// desc = **Note**: only supports categorical with ``int`` type (not applicable for data represented as pandas DataFrame in Python-package)
// desc = **Note**: all values will be cast to ``int32`` (integer codes will be extracted from pandas categoricals in the Python-package)
// desc = **Note**: index starts from ``0`` and it doesn't count the label column when passing type is ``int``
// desc = **Note**: all values should be less than ``Int32.MaxValue`` (2147483647)
// desc = **Note**: using large values could be memory consuming. Tree decision rule works best when categorical features are presented by consecutive integers starting from zero
Expand Down
4 changes: 2 additions & 2 deletions python-package/lightgbm/basic.py
Original file line number Diff line number Diff line change
Expand Up @@ -1155,7 +1155,7 @@ def __init__(self, data, label=None, reference=None,
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down Expand Up @@ -3560,7 +3560,7 @@ def refit(
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down
4 changes: 2 additions & 2 deletions python-package/lightgbm/engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ def train(
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down Expand Up @@ -460,7 +460,7 @@ def cv(params, train_set, num_boost_round=100,
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down
2 changes: 1 addition & 1 deletion python-package/lightgbm/sklearn.py
Original file line number Diff line number Diff line change
Expand Up @@ -258,7 +258,7 @@ def __call__(self, preds, dataset):
If list of int, interpreted as indices.
If list of str, interpreted as feature names (need to specify ``feature_name`` as well).
If 'auto' and data is pandas DataFrame, pandas unordered categorical columns are used.
All values in categorical features should be less than int32 max value (2147483647).
All values in categorical features will be cast to int32 and thus should be less than int32 max value (2147483647).
Large values could be memory consuming. Consider using consecutive integers starting from zero.
All negative values in categorical features will be treated as missing values.
The output cannot be monotonically constrained with respect to a categorical feature.
Expand Down

0 comments on commit 820ae7e

Please sign in to comment.