Results depend on categorical labels #3273

FiksII · 2020-08-04T10:36:40Z

Python package
LightGBM component:
lightgbm==2.3.1

Environment info

Operating System:
Windows 10
Python version: 3.7.6

The results (for example, cv) depend on categorical labels. As an example, I took Boston dataset, added artificial category feature with values ['low', 'mid', 'high']. Then, I encoded it:
dataset_1 is created based on {'low': 0, 'mid': 1, 'high': 2} encoding, dataset_2 is created based on {'low': 1, 'mid': 0, 'high': 2}.
And CV results are different.

import sklearn
from sklearn import datasets
import pandas

import lightgbm

lgb_params = {
    'boosting_type': 'gbrt',
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 1023,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1.0,
    'subsample_for_bin': 5,
    'subsample_freq': 0,
    'verbose': -1,
    'seed': 42
}

boston_dataset = sklearn.datasets.load_boston()

df_data = pandas.DataFrame(boston_dataset['data'], columns=boston_dataset['feature_names'])
df_target = pandas.Series(boston_dataset['target'])

df_data['category'] = pandas.cut(df_target, bins=[0, 20, 30, 80], labels=['low', 'mid', 'high'])
df_data_1 = df_data.copy()
df_data_1['category'] = df_data_1['category'].replace({'low': 0, 'mid': 1, 'high': 2}).astype(int)

for _ in range(10):
    dataset_1 = lightgbm.Dataset(data=df_data_1, label=boston_dataset['target'], categorical_feature=['category'])
    cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
            metrics=['l1'], stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(len(cv_result[metric_name]), cv_result[metric_name][-1])

91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038

df_data_2 = df_data.copy()
df_data_2['category'] = df_data_2['category'].replace({'low': 1, 'mid': 0, 'high': 2}).astype('int')
for _ in range(10):
    dataset_2 = lightgbm.Dataset(data=df_data_2, label=boston_dataset['target'], categorical_feature=['category'])
    cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
            metrics=['l1'], stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(len(cv_result[metric_name]), cv_result[metric_name][-1])

76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021

The text was updated successfully, but these errors were encountered:

guolinke · 2020-08-05T22:57:16Z

@shiyu1994 could you help to investigate this?

drexk · 2020-08-12T08:13:51Z

Is there any update on this issue?

We're also facing this critical issue in our production system and need the issue resolved urgently. I hope we can get an estimate of the time frame... thank you very much...

guolinke · 2020-08-12T08:26:26Z

@drexk @FiksII you can try to encode the categories start from 1, instead of 0.
We have some special handle for the category 0.

test results:

>>> df_data_1 = df_data.copy()
>>> df_data_1['category'] = df_data_1['category'].replace({'low': 1, 'mid': 2, 'high': 3}).astype(int)
>>>
>>> for i in range(10):
...     dataset_1 = lightgbm.Dataset(data=df_data_1, label=boston_dataset['target'], categorical_feature=['category'])
...     cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=i, early_stopping_rounds=15, metrics=['l1'], stratified=False)
...     metric_name = next(iter(cv_result.keys()))
...     print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
23 3.078809914662149
34 3.041569195467846
50 3.022566265676547
62 3.074435363402193
29 3.18402355037978
22 3.1114202718078685
24 3.1386666235755385
51 3.0808353633409604
123 3.1707610124398338
41 3.013074705873579
>>>
>>> df_data_2 = df_data.copy()
>>> df_data_2['category'] = df_data_2['category'].replace({'low': 2, 'mid': 1, 'high': 3}).astype('int')
>>> for i in range(10):
...     dataset_2 = lightgbm.Dataset(data=df_data_2, label=boston_dataset['target'], categorical_feature=['category'])
...     cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=i, early_stopping_rounds=15,  metrics=['l1'], stratified=False)
...     metric_name = next(iter(cv_result.keys()))
...     print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
23 3.078809914662149
34 3.041569195467846
50 3.022566265676547
62 3.074435363402193
29 3.18402355037978
22 3.1114202718078685
24 3.1386666235755385
51 3.0808353633409604
123 3.1707610124398338
41 3.013074705873579

FiksII · 2020-08-12T08:31:17Z

Ok. Seems it works. In this case, it needed to be fixed docs.

drexk · 2020-08-12T08:33:12Z

Thank you for attending to this bug quickly. I'll have a quick word with my team about this. Thanks!

guolinke · 2020-08-12T08:43:07Z

@FiksII @drexk no problem.

@shiyu1994 I think this seems a potential bug.
may is caused by the prediction, data partition, or feature histogram finding.

FiksII · 2020-08-13T11:04:30Z

Unfortunately, the bug still exists. Unfortunately, I could not find an example with a smaller number of samples.
failed_example.csv.gz
The code is similar:

import sklearn
from sklearn import datasets
import pandas
import lightgbm
lgb_params = {
    'boosting_type': 'gbrt',
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 1023,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1.0,
    'subsample_for_bin': 5,
    'subsample_freq': 0,
    'verbose': -1,
    'seed': 42,
    'metric': 'l1',
}
df_data = pandas.read_csv('failed_example.csv.gz')
target = df_data['target']
df_data = df_data[['category']]

df_data['category'].value_counts()

g 131659
k 107758
c 41485
a 24077
e 15251
b 4627
i 4103
m 3368
f 3049
d 2362
h 2034
j 172
l 11
Name: category, dtype: int64

df_data_1 = df_data.copy()
df_data_1['category'] = df_data_1['category'].replace(
    dict(zip('abcdefghijklm', range(1, 14)))
).astype('int')

dataset_1 = lightgbm.Dataset(data=df_data_1, feature_name=list(df_data_1.columns), 
                             label=target, categorical_feature=['category'])

cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                        stratified=False)
metric_name = next(iter(cv_result.keys()))

print(len(cv_result[metric_name]), cv_result[metric_name][-1])

C:\Miniconda3\lib\site-packages\lightgbm\basic.py:1291: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
171 3.889806896170679

df_data_2 = df_data.copy()
df_data_2['category'] = df_data_2['category'].replace(
    dict(zip('hiefgjklmabcd', range(1, 14)))
).astype('int')

dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns), 
                             label=target, categorical_feature=['category'])

cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                        stratified=False)

metric_name = next(iter(cv_result.keys()))
print(len(cv_result[metric_name]), cv_result[metric_name][-1])

269 3.4013529200176293

Just check...

df_data_1['category'].value_counts()
7 131659
11 107758
3 41485
1 24077
5 15251
2 4627
9 4103
13 3368
6 3049
4 2362
8 2034
10 172
12 11
Name: category, dtype: int64
df_data_2['category'].value_counts()
5 131659
7 107758
12 41485
10 24077
3 15251
11 4627
2 4103
9 3368
4 3049
13 2362
1 2034
6 172
8 11
Name: category, dtype: int64

FiksII · 2020-08-13T11:20:55Z

Or more common:

random.seed(111)

for i in range(15):
    categories = list('abcdefghijklm')
    random.shuffle(categories)
    
    df_data_2 = df_data.copy()
    df_data_2['category'] = df_data_2['category'].replace(
        dict(zip(categories, range(1, 14)))
    ).astype('int')

    dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns), 
                                 label=target, categorical_feature=['category'])

    cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                            stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(''.join(categories), len(cv_result[metric_name]), cv_result[metric_name][-1])

kacjlbeigmhfd 171 3.889806896170679
lfaimbdejghkc 171 3.889806896170679
cfbielaghjdmk 171 3.889806896170679
gbeciljakmdhf 269 3.4013529200176293
keglfdjacbimh 269 3.4013529200176293
bjadmfegikclh 171 3.889806896170679
bkhlgdafejimc 269 3.4013529200176293
mlcahkeidfgbj 171 3.889806896170679
cfmhjlaiebkdg 171 3.889806896170679
bcgjlfmidkaeh 269 3.4013529200176293
hajgmbckdlefi 171 3.889806896170679
hlcbedamfjkig 171 3.889806896170679
jgmcelfhidakb 269 3.4013529200176293
dkgjihlceamfb 269 3.4013529200176293
jdblmickegfha 269 3.4013529200176293

drexk · 2020-08-13T11:28:28Z

what a heart-breaking news. Our production system is bleeding money everyday because of this bug. @FiksII can you recommend an alternative machine learning library that my team can implement and switch to temporarily while waiting for this unfortunate disaster to get fixed? Thanks!!!

FiksII · 2020-08-13T11:46:38Z

@drexk Well, it's an open-source project, so we can't put pressure on them, I hope the developers are trying their the best. This library is one of the best compared with all others. That's why I'm using it. Good luck with your production system. I hope your team survives.

FiksII · 2020-08-13T12:12:54Z

Ok... It seems this bug is fixed in 3.0.0 @drexk

guolinke · 2020-08-13T12:18:29Z

@FiksII @drexk
The more reliable categorical solution is to use the numerical encoding solutions, like https://contrib.scikit-learn.org/category_encoders/ .

guolinke · 2020-08-13T12:22:36Z

@FiksII
to confirm, in 3.0.0, the problem is solved?

FiksII · 2020-08-13T12:26:58Z

@guolinke

The more reliable categorical solution is to use the numerical encoding solutions, like https://contrib.scikit-learn.org/category_encoders/ .

Of course, it's not the production code. But, sometimes, our categories are linked with different numbers. So, what's why I found such bug.

to confirm, in 3.0.0, the problem is solved?

So far, yes. But, I need more time for testing.

guolinke · 2020-08-13T12:32:50Z

From the code, I think mapping start from 1 should be identical for any mappings, since LightGBM will re-map it in training, according to the count of different categories.
The possible reason is that some categories have the same count, in this case, the one with the smaller index will be used first, and then may cause different results.

drexk · 2020-08-14T03:25:22Z

I appreciate all the pointers I get in this discussion.

FiksII · 2020-08-14T05:17:23Z

@guolinke. Approved. I have an example there are a lot of categories with 1 sample. And the results are different.

guolinke · 2020-08-14T05:21:20Z

@FiksII The low-frequency categories will be filtered in LightGBM, so it may not ensure consistency.
Using 'freq=1`'s categories often will result in over-fitting.

FiksII · 2020-08-14T05:26:13Z

@guolinke Yes, I understand. But, anyway, it looks strange, I dont know the details, but theoretically. Splitting by categorical feature is based on a metric (gini, for example), and it should not depend on the value of the categorical feature. As I remember, such tree node contains boolean condition, like: category == 1||10||15||25||26. Of course, if node has condition like category <26 there are no questions.

FiksII · 2020-08-14T05:47:09Z

@guolinke Just in case, this is another failed example.
failed_example_2.csv.gz

LightGBM commit 1804fd1

import pandas
import lightgbm
import random

lgb_params = {
    'boosting_type': 'gbrt',
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 1023,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1.0,
    'subsample_for_bin': 5,
    'subsample_freq': 0,
    'verbose': -1,
    'seed': 42,
    'metric': 'l1',
}

df_data = pandas.read_csv('failed_example_2.csv.gz')

target = df_data['target']
df_data = df_data[['category']]

categories = list(set(df_data['category']))

random.seed(111)

for i in range(15):
    random.shuffle(categories)
    
    df_data_2 = df_data.copy()
    df_data_2['category'] = df_data_2['category'].replace(
        dict(zip(categories, range(1, len(categories)+1)))
    ).astype('int')

    dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns), 
                                 label=target, categorical_feature=['category'])

    cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                            stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(len(cv_result[metric_name]), cv_result[metric_name][-1])

35 4.0142106708262695
171 3.9235411538859406
16 4.01938819823927
35 4.0142106708262695
35 4.0142106708262695
178 3.9220688592556137
150 3.705064934135586
35 4.0142106708262695
16 4.0193881982392705
35 4.0142106708262695
150 3.705064934135586
178 3.9220688592556128
16 4.0193881982392705
178 3.922068859255613
150 3.7050649341355864

guolinke · 2020-08-14T10:03:22Z

#3305 should fix the problem with zero bins.

I am investigating your example now @FiksII

guolinke · 2020-08-14T14:38:07Z

@FiksII why you use so small subsample_for_bin ? this is very unstable.
subsample_form_bin means the sampled data used for the bin bucketed. using only 5 samples is very strange.

Besides, you can use larger min_data_in_bin to reduce the side effect of categories with the same count.

An test by your example:

>>> import pandas
>>> import lightgbm
>>> import random
>>>
>>> lgb_params = {
...     'boosting_type': 'gbrt',
...     'learning_rate': 0.1,
...     'max_depth': -1,
...     'min_child_samples': 20,
...     'min_child_weight': 0.001,
...     'min_split_gain': 0.0,
...     'n_jobs': -1,
...     'num_leaves': 1023,
...     'reg_alpha': 0.0,
...     'reg_lambda': 0.0,
...     'subsample': 1.0,
...     'subsample_freq': 0,
...     'verbose': -1,
...     'seed': 42,
...     'metric': 'l1',
...     'min_data_in_bin': 50,
... }
>>>
>>> df_data = pandas.read_csv('failed_example_2.csv.gz')
>>>
>>> target = df_data['target']
>>> df_data = df_data[['category']]
>>>
>>> categories = list(set(df_data['category']))
>>>
>>> random.seed(111)
>>>
>>> for i in range(15):
...     random.shuffle(categories)
...     df_data_2 = df_data.copy()
...     df_data_2['category'] = df_data_2['category'].replace(
...         dict(zip(categories, range(1, len(categories)+1)))
...     ).astype('int')
...     dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns),
...                                  label=target, categorical_feature=['category'])
...     cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,
...                             stratified=False)
...     metric_name = next(iter(cv_result.keys()))
...     print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
168 1.999748467754571
168 1.9997484677548742
168 1.9997484677548742
168 1.9997484677548742
168 1.9997484677548742

guolinke · 2020-08-14T14:41:11Z

@FiksII

Yes, I understand. But, anyway, it looks strange, I dont know the details, but theoretically. Splitting by categorical feature is based on a metric (gini, for example), and it should not depend on the value of the categorical feature. As I remember, such tree node contains boolean condition, like: category == 1||10||15||25||26. Of course, if node has condition like category <26 there are no questions.

The split finding in categorical feature is first to sort, then find the split from left to right (and right to left).
Therefore, some categories with the same count, may result in the different sort order, when using different encoding, and silghtly cause the different result.

github-actions · 2023-08-23T22:36:17Z

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

guolinke mentioned this issue Aug 14, 2020

fix zero bin in categorical split #3305

Merged

guolinke closed this as completed in #3305 Aug 15, 2020

github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results depend on categorical labels #3273

Results depend on categorical labels #3273

FiksII commented Aug 4, 2020 •

edited

Loading

guolinke commented Aug 5, 2020

drexk commented Aug 12, 2020

guolinke commented Aug 12, 2020

FiksII commented Aug 12, 2020

drexk commented Aug 12, 2020

guolinke commented Aug 12, 2020

FiksII commented Aug 13, 2020

FiksII commented Aug 13, 2020

drexk commented Aug 13, 2020

FiksII commented Aug 13, 2020

FiksII commented Aug 13, 2020

guolinke commented Aug 13, 2020

guolinke commented Aug 13, 2020

FiksII commented Aug 13, 2020 •

edited

Loading

guolinke commented Aug 13, 2020

drexk commented Aug 14, 2020

FiksII commented Aug 14, 2020

guolinke commented Aug 14, 2020

FiksII commented Aug 14, 2020 •

edited

Loading

FiksII commented Aug 14, 2020 •

edited

Loading

guolinke commented Aug 14, 2020

guolinke commented Aug 14, 2020

guolinke commented Aug 14, 2020

github-actions bot commented Aug 23, 2023

Results depend on categorical labels #3273

Results depend on categorical labels #3273

Comments

FiksII commented Aug 4, 2020 • edited Loading

Environment info

guolinke commented Aug 5, 2020

drexk commented Aug 12, 2020

guolinke commented Aug 12, 2020

FiksII commented Aug 12, 2020

drexk commented Aug 12, 2020

guolinke commented Aug 12, 2020

FiksII commented Aug 13, 2020

FiksII commented Aug 13, 2020

drexk commented Aug 13, 2020

FiksII commented Aug 13, 2020

FiksII commented Aug 13, 2020

guolinke commented Aug 13, 2020

guolinke commented Aug 13, 2020

FiksII commented Aug 13, 2020 • edited Loading

guolinke commented Aug 13, 2020

drexk commented Aug 14, 2020

FiksII commented Aug 14, 2020

guolinke commented Aug 14, 2020

FiksII commented Aug 14, 2020 • edited Loading

FiksII commented Aug 14, 2020 • edited Loading

guolinke commented Aug 14, 2020

guolinke commented Aug 14, 2020

guolinke commented Aug 14, 2020

github-actions bot commented Aug 23, 2023

FiksII commented Aug 4, 2020 •

edited

Loading

FiksII commented Aug 13, 2020 •

edited

Loading

FiksII commented Aug 14, 2020 •

edited

Loading

FiksII commented Aug 14, 2020 •

edited

Loading