[python] suppress the warning about categorical feature override #3379

guolinke · 2020-09-11T07:36:26Z

C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py:1555: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['C01', 'C02', 'C03', 'C04', 'C05', 'C06', 'C07', 'C08', 'C09', 'C10', 'C11', 'C12', 'C13', 'C14', 'C15', 'C16', 'C17', 'C18', 'C19', 'C20', 'C21', 'C22', 'C23', 'C24', 'C25', 'C26']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py:1286: UserWarning: Overriding the parameters from Reference Dataset.
  warnings.warn('Overriding the parameters from Reference Dataset.')
C:\ProgramData\Anaconda3\lib\site-packages\lightgbm\basic.py:1098: UserWarning: categorical_column in param dict is overridden.
  warnings.warn('{} in param dict is overridden.'.format(cat_alias))

categorical_column could be set in both lgb.train and lgb.Dataset . But this warning seems always show up if setting categorical_column. I think this is quite annoying.

The text was updated successfully, but these errors were encountered:

onshek · 2020-09-23T13:23:20Z

I'm willing to take this as my first contribution to the repo.
It seems a new parameter ignore_warnings detault to be True can be added to related functions, right?

XiaozhouWang85 · 2020-10-14T09:15:28Z

From reading other issues, it seems the "correct" way of defining categorical columns is via the Dataset. This might only be a problem when the input is a Pandas dataframe.

The problem is that when calling train the categorical column defaults to auto:
categorical_feature (list of strings or int, or 'auto', optional (default="auto"))

If we had an option of None then that might be a way of getting rid of conflicting settings that require a warning.

istavnit · 2020-10-19T05:19:34Z

Seems to be not a friendly way to suppress this warning:

            XX_train = XX_train[keeperCols]
            XX_valid = XX_valid[keeperCols]
            #I have only one categorical feature 'DOW'  
            trainData = Dataset(XX_train,yy_train,feature_name=keeperCols,categorical_feature=['DOW'])
            valid_data = trainData.create_valid(XX_valid,label=yy_valid)
            params["seed"]=theSeed
            bst = lgb.train(params, trainData, valid_sets=[valid_data], categorical_feature=['DOW'],feature_name=keeperCols,
                            num_boost_round=BOOSTROUNDS, early_stopping_rounds=EARLYSTOPROUNDS,verbose_eval=VERBOSEEVAL)

results in these warnings:

C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1551: UserWarning: Using categorical_feature in Dataset.
  warnings.warn('Using categorical_feature in Dataset.')
C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1555: UserWarning: categorical_feature in Dataset is overridden.
New categorical_feature is ['DOW']
  'New categorical_feature is {}'.format(sorted(list(categorical_feature))))
C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1286: UserWarning: Overriding the parameters from Reference Dataset.
  warnings.warn('Overriding the parameters from Reference Dataset.')
C:\Users\meanc\anaconda3\envs\keras\lib\site-packages\lightgbm\basic.py:1098: UserWarning: categorical_column in param dict is overridden.
  warnings.warn('{} in param dict is overridden.'.format(cat_alias))

memeplex · 2020-12-10T21:06:04Z

When I do a grid search using the sklearn api and pass eval_set to fit, I get this warning for every element in the grid (many times!).

I'm just passing a dataframe with categorical features as the train X, the same for eval set, never explicitly passing categorical_feature.

I don't think this behavior is desirable.

memeplex · 2020-12-10T21:52:37Z

Indeed it's not necessary to use the sklearn API in order to reproduce the above. I've provided simple instructions in #3640.

tripti0125 · 2021-03-18T12:18:55Z

I get this warning when using scikit-learn wrapper of LightGBM. Dataset passed to LightGBM is through a scikit-learn pipeline which preprocesses the data in a pandas dataframe and produces a numpy array. Note that this input dataset which the model receives is NOT a Pandas dataframe but numpy array. I set the feature_name and categorical_feature parameters in fit() method as this is the only place these can be set, if you're not using LightGBM native Dataset creation.

I think the warning is useful in some situations but superfluous in the case mentioned above.

C:..\anaconda3\lib\site-packages\lightgbm\basic.py:1286: UserWarning: Overriding the parameters from Reference Dataset.
warnings.warn('Overriding the parameters from Reference Dataset.')
C:..\anaconda3\lib\site-packages\lightgbm\basic.py:1098: UserWarning: categorical_column in param dict is overridden.
warnings.warn('{} in param dict is overridden.'.format(cat_alias))

ThomasBourgeois · 2021-06-23T14:04:06Z

Hi all,
same issue here : I'm sepcifying categorical features pretty much everywhere : train and val datasets, plus in train, and I still get warnings.. Very strange.
Warnings obtained :
"Overriding the parameters from Reference Dataset.
categorical_column in param dict is overridden."

Code :

train_data = lgb.Dataset(train[feats], label=train[target],
                         feature_name=feats,
                         categorical_feature=cat_feats)
val_data = lgb.Dataset(val[feats], label=val[target], reference=train_data,
                          feature_name=feats,
                          categorical_feature=cat_feats)

params = {'objective': 'mean_squared_error', 
        'metric':'rmse', 'eta': 0.1, 'bagging_fraction': 0.5 }
num_round = 300
bst = lgb.train(params, train_data, num_round, valid_sets=[val_data], early_stopping_rounds=20, 
               categorical_feature=cat_feats, feature_name=feats
               )

memeplex · 2021-08-28T22:40:03Z

With multiple processes in a grid search it's not event possible to use a context manager to suppress this warning during fit, it seems that the context state is lost somehow, my notebook gets literally flooded of:

/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
/Users/carlos/Base/Environments/Jampp/lib/python3.9/site-packages/lightgbm/basic.py:1702: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')
....

memeplex · 2021-08-28T22:50:34Z

As a workaround I did the following:

class SilentRegressor(lgb.LGBMRegressor):
    def fit(self, *args, **kwargs):
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            return super().fit(*args, verbose=False, **kwargs)

hzy46 · 2021-11-01T09:42:28Z

I did some investigation on this issue. There're mainly two kinds of source for these warnings regarding categorical features.

If the dataset doens't have a reference, the warnings only come from here:

LightGBM/python-package/lightgbm/basic.py

Lines 2045 to 2074 in 7fa07ee

    
               def set_categorical_feature(self, categorical_feature): 
        
                   """Set categorical features. 
        
                   Parameters 
        
                   ---------- 
        
                   categorical_feature : list of int or str 
        
                       Names or indices of categorical features. 
        
                   Returns 
        
                   ------- 
        
                   self : Dataset 
        
                       Dataset with set categorical features. 
        
                   """ 
        
                   if self.categorical_feature == categorical_feature: 
        
                       return self 
        
                   if self.data is not None: 
        
                       if self.categorical_feature is None: 
        
                           self.categorical_feature = categorical_feature 
        
                           return self._free_handle() 
        
                       elif categorical_feature == 'auto': 
        
                           _log_warning('Using categorical_feature in Dataset.') 
        
                           return self 
        
                       else: 
        
                           _log_warning('categorical_feature in Dataset is overridden.\n' 
        
                                        f'New categorical_feature is {sorted(list(categorical_feature))}') 
        
                           self.categorical_feature = categorical_feature 
        
                           return self._free_handle() 
        
                   else: 
        
                       raise LightGBMError("Cannot set categorical feature after freed raw data, " 
        
                                           "set free_raw_data=False when construct Dataset to avoid this.")

This function will be called before the Dataset.construct() is called.

One can use the following code to reproduce:

import random
import numpy as np
import pandas as pd
import lightgbm as lgb


Categorical_Feature_When_Construct_Dataset = ["a", "b", "d"]
Categorical_Feature_When_Train = 'auto'


def get_data(N):
    data = []
    labels = []
    for i in range(N):
        sample = {
            "a": random.choice([100, 200, 300, 400]),
            "b": random.choice([222, 333]),
            "c": random.random(),
        }
        if sample["a"] == 200 or sample["a"] == 300:
            if sample["b"] == 333:
                label = 1
            else:
                label = 0
        else:
            label = 0
        labels.append(label)
        data.append(sample)
    features = pd.DataFrame(data)
    features["d"] = pd.Categorical(
        [random.choice(["x", "y", "z"]) for i in range(N)], categories=["x", "y", "z"], ordered=False
    )
    labels = pd.Series(labels)
    return features, labels
 
N = 1000
train_features, train_labels = get_data(N)
test_features, test_labels = get_data(N)
 
lgb_train = lgb.Dataset(train_features, train_labels, categorical_feature=Categorical_Feature_When_Construct_Dataset)


params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 4,
    'learning_rate': 0.5,
    'verbose': 0,
}


gbm = lgb.train(params,
                lgb_train,
                num_boost_round=1,
                categorical_feature=Categorical_Feature_When_Train,
)

Here, if Categorical_Feature_When_Construct_Dataset doesn't equal to Categorical_Feature_When_Train, a warning will be reported. These two parameters have the same default value: auto. It becomes confusing when thw user only sets Categorical_Feature_When_Construct_Dataset, and don't set Categorical_Feature_When_Train. His expectation should be using the same categorical feature as the dataset's setting. But an warning is reported in this case because Categorical_Feature_When_Train has an default value auto, which doesn't align with Categorical_Feature_When_Construct_Dataset.

If we use cf1 to represent Categorical_Feature_When_Construct_Dataset, and cf2 to represent Categorical_Feature_When_Train, we can have the following behavior:

case	`cf1`	`cf2`	Current behavior	Warning
1	auto	auto	auto	no
2	auto	specific columns	use `cf2`	yes
3	specific columns	auto	use `cf1`	yes
4	specific columns	specific columns	use `cf2`	yes if `cf1` and `cf2` are different

For this first source, my proposal is:

If the user is using specific columns to override "auto", we don't report the warning. Because the user is just overriding the default parameter.

It aligns with the current behavior. What we need to do is to remove the warning information for case 2 and case 3 in the table.

The second source comes from the dataset with a reference:

LightGBM/python-package/lightgbm/basic.py

Lines 1778 to 1781 in 7fa07ee

    
           reference_params = self.reference.get_params() 
        
           if self.get_params() != reference_params: 
        
               _log_warning('Overriding the parameters from Reference Dataset.') 
        
               self._update_params(reference_params)

This is always reported if the referenced dataset has any categorical features. For the referenced dataset, its self.params is changed here:

LightGBM/python-package/lightgbm/basic.py

Lines 1498 to 1518 in 7fa07ee

    
           if categorical_feature is not None: 
        
               categorical_indices = set() 
        
               feature_dict = {} 
        
               if feature_name is not None: 
        
                   feature_dict = {name: i for i, name in enumerate(feature_name)} 
        
               for name in categorical_feature: 
        
                   if isinstance(name, str) and name in feature_dict: 
        
                       categorical_indices.add(feature_dict[name]) 
        
                   elif isinstance(name, int): 
        
                       categorical_indices.add(name) 
        
                   else: 
        
                       raise TypeError(f"Wrong type({type(name).__name__}) or unknown name({name}) in categorical_feature") 
        
               if categorical_indices: 
        
                   for cat_alias in _ConfigAliases.get("categorical_feature"): 
        
                       if cat_alias in params: 
        
                           _log_warning(f'{cat_alias} in param dict is overridden.') 
        
                           params.pop(cat_alias, None) 
        
                   params['categorical_column'] = sorted(categorical_indices) 
        
           params_str = param_dict_to_str(params) 
        
           self.params = params

For this one, my suggestion is to ignore categorical features when comparing the params.

shiyu1994 · 2021-11-01T14:15:40Z

Thanks for your detailed analysis. I think the proposed solution is feasible!

thisisreallife · 2022-06-09T05:58:33Z

As a workaround I did the following:

class SilentRegressor(lgb.LGBMRegressor):
    def fit(self, *args, **kwargs):
        with warnings.catch_warnings():
            warnings.filterwarnings("ignore", category=UserWarning)
            return super().fit(*args, verbose=False, **kwargs)

following code is fine, if we do not want to create a new class.

import warnings
warnings.filterwarnings("ignore", category=UserWarning)

Nevermetyou65 · 2023-03-30T02:54:00Z

Is this problem solved??
I still can get this warning even I set auto when construction dataset and train. It's so annoying

github-actions · 2023-08-19T03:01:09Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

jameslamb added the good first issue label Sep 11, 2020

StrikerRUS mentioned this issue Dec 4, 2020

Categorical features in sklearn API warning #3625

Closed

memeplex mentioned this issue Dec 10, 2020

Spurious warning when passing valid_sets #3640

Closed

memeplex mentioned this issue Dec 11, 2020

Warning shown with verbosity=-1 #3641

Closed

shiyu1994 self-assigned this Mar 24, 2021

StrikerRUS mentioned this issue Jul 10, 2021

[Question] Passing categorical feature with data-type "Category" without passing "categorical_feature" #4460

Closed

scarlett2018 mentioned this issue Oct 20, 2021

[Draft] Oct~Nov iteration Plan #4677

Closed

16 tasks

hzy46 self-assigned this Oct 29, 2021

This was referenced Nov 2, 2021

Suppress categorical warning (fixes #3379) #4762

Closed

Suppress categorical warning (fixes #3379) #4768

Merged

shiyu1994 closed this as completed in #4768 Nov 8, 2021

shiyu1994 pushed a commit that referenced this issue Nov 8, 2021

Suppress categorical warning (fixes #3379)

b1facf5

StrikerRUS mentioned this issue Jan 6, 2022

[DO NOT MERGE] Release 3.3.2 #4930

Closed

13 tasks

jameslamb mentioned this issue Oct 7, 2022

[DO NOT MERGE] Release v3.3.3 #5525

Closed

40 tasks

jameslamb mentioned this issue Feb 1, 2023

[Warning] UserWarning: Using categorical_feature in Dataset. #3718

Closed

github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] suppress the warning about categorical feature override #3379

[python] suppress the warning about categorical feature override #3379

guolinke commented Sep 11, 2020

onshek commented Sep 23, 2020

XiaozhouWang85 commented Oct 14, 2020

istavnit commented Oct 19, 2020

memeplex commented Dec 10, 2020 •

edited

Loading

memeplex commented Dec 10, 2020

tripti0125 commented Mar 18, 2021

ThomasBourgeois commented Jun 23, 2021 •

edited

Loading

memeplex commented Aug 28, 2021

memeplex commented Aug 28, 2021 •

edited

Loading

hzy46 commented Nov 1, 2021 •

edited

Loading

shiyu1994 commented Nov 1, 2021

thisisreallife commented Jun 9, 2022

Nevermetyou65 commented Mar 30, 2023

github-actions bot commented Aug 19, 2023

[python] suppress the warning about categorical feature override #3379

[python] suppress the warning about categorical feature override #3379

Comments

guolinke commented Sep 11, 2020

onshek commented Sep 23, 2020

XiaozhouWang85 commented Oct 14, 2020

istavnit commented Oct 19, 2020

memeplex commented Dec 10, 2020 • edited Loading

memeplex commented Dec 10, 2020

tripti0125 commented Mar 18, 2021

ThomasBourgeois commented Jun 23, 2021 • edited Loading

memeplex commented Aug 28, 2021

memeplex commented Aug 28, 2021 • edited Loading

hzy46 commented Nov 1, 2021 • edited Loading

If the dataset doens't have a reference, the warnings only come from here:

If the user is using specific columns to override "auto", we don't report the warning. Because the user is just overriding the default parameter.

The second source comes from the dataset with a reference:

shiyu1994 commented Nov 1, 2021

thisisreallife commented Jun 9, 2022

Nevermetyou65 commented Mar 30, 2023

github-actions bot commented Aug 19, 2023

memeplex commented Dec 10, 2020 •

edited

Loading

ThomasBourgeois commented Jun 23, 2021 •

edited

Loading

memeplex commented Aug 28, 2021 •

edited

Loading

hzy46 commented Nov 1, 2021 •

edited

Loading