Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results depend on categorical labels #3273

Closed
FiksII opened this issue Aug 4, 2020 · 24 comments · Fixed by #3305
Closed

Results depend on categorical labels #3273

FiksII opened this issue Aug 4, 2020 · 24 comments · Fixed by #3305

Comments

@FiksII
Copy link

FiksII commented Aug 4, 2020

Python package
LightGBM component:
lightgbm==2.3.1

Environment info

Operating System:
Windows 10
Python version: 3.7.6

The results (for example, cv) depend on categorical labels. As an example, I took Boston dataset, added artificial category feature with values ['low', 'mid', 'high']. Then, I encoded it:
dataset_1 is created based on {'low': 0, 'mid': 1, 'high': 2} encoding, dataset_2 is created based on {'low': 1, 'mid': 0, 'high': 2}.
And CV results are different.

import sklearn
from sklearn import datasets
import pandas

import lightgbm

lgb_params = {
    'boosting_type': 'gbrt',
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 1023,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1.0,
    'subsample_for_bin': 5,
    'subsample_freq': 0,
    'verbose': -1,
    'seed': 42
}

boston_dataset = sklearn.datasets.load_boston()

df_data = pandas.DataFrame(boston_dataset['data'], columns=boston_dataset['feature_names'])
df_target = pandas.Series(boston_dataset['target'])

df_data['category'] = pandas.cut(df_target, bins=[0, 20, 30, 80], labels=['low', 'mid', 'high'])
df_data_1 = df_data.copy()
df_data_1['category'] = df_data_1['category'].replace({'low': 0, 'mid': 1, 'high': 2}).astype(int)

for _ in range(10):
    dataset_1 = lightgbm.Dataset(data=df_data_1, label=boston_dataset['target'], categorical_feature=['category'])
    cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
            metrics=['l1'], stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(len(cv_result[metric_name]), cv_result[metric_name][-1])

91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038
91 2.982193080452038

df_data_2 = df_data.copy()
df_data_2['category'] = df_data_2['category'].replace({'low': 1, 'mid': 0, 'high': 2}).astype('int')
for _ in range(10):
    dataset_2 = lightgbm.Dataset(data=df_data_2, label=boston_dataset['target'], categorical_feature=['category'])
    cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
            metrics=['l1'], stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(len(cv_result[metric_name]), cv_result[metric_name][-1])

76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021
76 3.20606765690021

@guolinke
Copy link
Collaborator

guolinke commented Aug 5, 2020

@shiyu1994 could you help to investigate this?

@drexk
Copy link

drexk commented Aug 12, 2020

Is there any update on this issue?

We're also facing this critical issue in our production system and need the issue resolved urgently. I hope we can get an estimate of the time frame... thank you very much...

@guolinke
Copy link
Collaborator

@drexk @FiksII you can try to encode the categories start from 1, instead of 0.
We have some special handle for the category 0.

test results:

>>> df_data_1 = df_data.copy()
>>> df_data_1['category'] = df_data_1['category'].replace({'low': 1, 'mid': 2, 'high': 3}).astype(int)
>>>
>>> for i in range(10):
...     dataset_1 = lightgbm.Dataset(data=df_data_1, label=boston_dataset['target'], categorical_feature=['category'])
...     cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=i, early_stopping_rounds=15, metrics=['l1'], stratified=False)
...     metric_name = next(iter(cv_result.keys()))
...     print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
23 3.078809914662149
34 3.041569195467846
50 3.022566265676547
62 3.074435363402193
29 3.18402355037978
22 3.1114202718078685
24 3.1386666235755385
51 3.0808353633409604
123 3.1707610124398338
41 3.013074705873579
>>>
>>> df_data_2 = df_data.copy()
>>> df_data_2['category'] = df_data_2['category'].replace({'low': 2, 'mid': 1, 'high': 3}).astype('int')
>>> for i in range(10):
...     dataset_2 = lightgbm.Dataset(data=df_data_2, label=boston_dataset['target'], categorical_feature=['category'])
...     cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=i, early_stopping_rounds=15,  metrics=['l1'], stratified=False)
...     metric_name = next(iter(cv_result.keys()))
...     print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
23 3.078809914662149
34 3.041569195467846
50 3.022566265676547
62 3.074435363402193
29 3.18402355037978
22 3.1114202718078685
24 3.1386666235755385
51 3.0808353633409604
123 3.1707610124398338
41 3.013074705873579

@FiksII
Copy link
Author

FiksII commented Aug 12, 2020

Ok. Seems it works. In this case, it needed to be fixed docs.

image

@drexk
Copy link

drexk commented Aug 12, 2020

Thank you for attending to this bug quickly. I'll have a quick word with my team about this. Thanks!

@guolinke
Copy link
Collaborator

@FiksII @drexk no problem.

@shiyu1994 I think this seems a potential bug.
may is caused by the prediction, data partition, or feature histogram finding.

@FiksII
Copy link
Author

FiksII commented Aug 13, 2020

Unfortunately, the bug still exists. Unfortunately, I could not find an example with a smaller number of samples.
failed_example.csv.gz
The code is similar:

import sklearn
from sklearn import datasets
import pandas
import lightgbm
lgb_params = {
    'boosting_type': 'gbrt',
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 1023,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1.0,
    'subsample_for_bin': 5,
    'subsample_freq': 0,
    'verbose': -1,
    'seed': 42,
    'metric': 'l1',
}
df_data = pandas.read_csv('failed_example.csv.gz')
target = df_data['target']
df_data = df_data[['category']]

df_data['category'].value_counts()

g 131659
k 107758
c 41485
a 24077
e 15251
b 4627
i 4103
m 3368
f 3049
d 2362
h 2034
j 172
l 11
Name: category, dtype: int64

df_data_1 = df_data.copy()
df_data_1['category'] = df_data_1['category'].replace(
    dict(zip('abcdefghijklm', range(1, 14)))
).astype('int')

dataset_1 = lightgbm.Dataset(data=df_data_1, feature_name=list(df_data_1.columns), 
                             label=target, categorical_feature=['category'])

cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                        stratified=False)
metric_name = next(iter(cv_result.keys()))

print(len(cv_result[metric_name]), cv_result[metric_name][-1])

C:\Miniconda3\lib\site-packages\lightgbm\basic.py:1291: UserWarning: Using categorical_feature in Dataset.
warnings.warn('Using categorical_feature in Dataset.')
171 3.889806896170679

df_data_2 = df_data.copy()
df_data_2['category'] = df_data_2['category'].replace(
    dict(zip('hiefgjklmabcd', range(1, 14)))
).astype('int')

dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns), 
                             label=target, categorical_feature=['category'])

cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                        stratified=False)

metric_name = next(iter(cv_result.keys()))
print(len(cv_result[metric_name]), cv_result[metric_name][-1])

269 3.4013529200176293

Just check...

df_data_1['category'].value_counts()
7 131659
11 107758
3 41485
1 24077
5 15251
2 4627
9 4103
13 3368
6 3049
4 2362
8 2034
10 172
12 11
Name: category, dtype: int64
df_data_2['category'].value_counts()
5 131659
7 107758
12 41485
10 24077
3 15251
11 4627
2 4103
9 3368
4 3049
13 2362
1 2034
6 172
8 11
Name: category, dtype: int64

@FiksII
Copy link
Author

FiksII commented Aug 13, 2020

Or more common:

random.seed(111)

for i in range(15):
    categories = list('abcdefghijklm')
    random.shuffle(categories)
    
    df_data_2 = df_data.copy()
    df_data_2['category'] = df_data_2['category'].replace(
        dict(zip(categories, range(1, 14)))
    ).astype('int')

    dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns), 
                                 label=target, categorical_feature=['category'])

    cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                            stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(''.join(categories), len(cv_result[metric_name]), cv_result[metric_name][-1])

kacjlbeigmhfd 171 3.889806896170679
lfaimbdejghkc 171 3.889806896170679
cfbielaghjdmk 171 3.889806896170679
gbeciljakmdhf 269 3.4013529200176293
keglfdjacbimh 269 3.4013529200176293
bjadmfegikclh 171 3.889806896170679
bkhlgdafejimc 269 3.4013529200176293
mlcahkeidfgbj 171 3.889806896170679
cfmhjlaiebkdg 171 3.889806896170679
bcgjlfmidkaeh 269 3.4013529200176293
hajgmbckdlefi 171 3.889806896170679
hlcbedamfjkig 171 3.889806896170679
jgmcelfhidakb 269 3.4013529200176293
dkgjihlceamfb 269 3.4013529200176293
jdblmickegfha 269 3.4013529200176293

@drexk
Copy link

drexk commented Aug 13, 2020

what a heart-breaking news. Our production system is bleeding money everyday because of this bug. @FiksII can you recommend an alternative machine learning library that my team can implement and switch to temporarily while waiting for this unfortunate disaster to get fixed? Thanks!!!

@FiksII
Copy link
Author

FiksII commented Aug 13, 2020

@drexk Well, it's an open-source project, so we can't put pressure on them, I hope the developers are trying their the best. This library is one of the best compared with all others. That's why I'm using it. Good luck with your production system. I hope your team survives.

@FiksII
Copy link
Author

FiksII commented Aug 13, 2020

Ok... It seems this bug is fixed in 3.0.0 @drexk

@guolinke
Copy link
Collaborator

@FiksII @drexk
The more reliable categorical solution is to use the numerical encoding solutions, like https://contrib.scikit-learn.org/category_encoders/ .

@guolinke
Copy link
Collaborator

@FiksII
to confirm, in 3.0.0, the problem is solved?

@FiksII
Copy link
Author

FiksII commented Aug 13, 2020

@guolinke

The more reliable categorical solution is to use the numerical encoding solutions, like https://contrib.scikit-learn.org/category_encoders/ .

Of course, it's not the production code. But, sometimes, our categories are linked with different numbers. So, what's why I found such bug.

to confirm, in 3.0.0, the problem is solved?

So far, yes. But, I need more time for testing.

@guolinke
Copy link
Collaborator

From the code, I think mapping start from 1 should be identical for any mappings, since LightGBM will re-map it in training, according to the count of different categories.
The possible reason is that some categories have the same count, in this case, the one with the smaller index will be used first, and then may cause different results.

@drexk
Copy link

drexk commented Aug 14, 2020

I appreciate all the pointers I get in this discussion.

@FiksII
Copy link
Author

FiksII commented Aug 14, 2020

@guolinke. Approved. I have an example there are a lot of categories with 1 sample. And the results are different.

@guolinke
Copy link
Collaborator

@FiksII The low-frequency categories will be filtered in LightGBM, so it may not ensure consistency.
Using 'freq=1`'s categories often will result in over-fitting.

@FiksII
Copy link
Author

FiksII commented Aug 14, 2020

@guolinke Yes, I understand. But, anyway, it looks strange, I dont know the details, but theoretically. Splitting by categorical feature is based on a metric (gini, for example), and it should not depend on the value of the categorical feature. As I remember, such tree node contains boolean condition, like: category == 1||10||15||25||26. Of course, if node has condition like category <26 there are no questions.

@FiksII
Copy link
Author

FiksII commented Aug 14, 2020

@guolinke Just in case, this is another failed example.
failed_example_2.csv.gz

LightGBM commit 1804fd1

import pandas
import lightgbm
import random

lgb_params = {
    'boosting_type': 'gbrt',
    'learning_rate': 0.1,
    'max_depth': -1,
    'min_child_samples': 20,
    'min_child_weight': 0.001,
    'min_split_gain': 0.0,
    'n_jobs': -1,
    'num_leaves': 1023,
    'reg_alpha': 0.0,
    'reg_lambda': 0.0,
    'subsample': 1.0,
    'subsample_for_bin': 5,
    'subsample_freq': 0,
    'verbose': -1,
    'seed': 42,
    'metric': 'l1',
}

df_data = pandas.read_csv('failed_example_2.csv.gz')

target = df_data['target']
df_data = df_data[['category']]

categories = list(set(df_data['category']))

random.seed(111)

for i in range(15):
    random.shuffle(categories)
    
    df_data_2 = df_data.copy()
    df_data_2['category'] = df_data_2['category'].replace(
        dict(zip(categories, range(1, len(categories)+1)))
    ).astype('int')

    dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns), 
                                 label=target, categorical_feature=['category'])

    cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,  
                            stratified=False)

    metric_name = next(iter(cv_result.keys()))
    print(len(cv_result[metric_name]), cv_result[metric_name][-1])

35 4.0142106708262695
171 3.9235411538859406
16 4.01938819823927
35 4.0142106708262695
35 4.0142106708262695
178 3.9220688592556137
150 3.705064934135586
35 4.0142106708262695
16 4.0193881982392705
35 4.0142106708262695
150 3.705064934135586
178 3.9220688592556128
16 4.0193881982392705
178 3.922068859255613
150 3.7050649341355864

@guolinke
Copy link
Collaborator

#3305 should fix the problem with zero bins.

I am investigating your example now @FiksII

@guolinke
Copy link
Collaborator

@FiksII why you use so small subsample_for_bin ? this is very unstable.
subsample_form_bin means the sampled data used for the bin bucketed. using only 5 samples is very strange.

Besides, you can use larger min_data_in_bin to reduce the side effect of categories with the same count.

An test by your example:

>>> import pandas
>>> import lightgbm
>>> import random
>>>
>>> lgb_params = {
...     'boosting_type': 'gbrt',
...     'learning_rate': 0.1,
...     'max_depth': -1,
...     'min_child_samples': 20,
...     'min_child_weight': 0.001,
...     'min_split_gain': 0.0,
...     'n_jobs': -1,
...     'num_leaves': 1023,
...     'reg_alpha': 0.0,
...     'reg_lambda': 0.0,
...     'subsample': 1.0,
...     'subsample_freq': 0,
...     'verbose': -1,
...     'seed': 42,
...     'metric': 'l1',
...     'min_data_in_bin': 50,
... }
>>>
>>> df_data = pandas.read_csv('failed_example_2.csv.gz')
>>>
>>> target = df_data['target']
>>> df_data = df_data[['category']]
>>>
>>> categories = list(set(df_data['category']))
>>>
>>> random.seed(111)
>>>
>>> for i in range(15):
...     random.shuffle(categories)
...     df_data_2 = df_data.copy()
...     df_data_2['category'] = df_data_2['category'].replace(
...         dict(zip(categories, range(1, len(categories)+1)))
...     ).astype('int')
...     dataset_2 = lightgbm.Dataset(data=df_data_2, feature_name=list(df_data_2.columns),
...                                  label=target, categorical_feature=['category'])
...     cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=23, early_stopping_rounds=15,
...                             stratified=False)
...     metric_name = next(iter(cv_result.keys()))
...     print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
168 1.999748467754571
168 1.9997484677548742
168 1.9997484677548742
168 1.9997484677548742
168 1.9997484677548742

@guolinke
Copy link
Collaborator

@FiksII

Yes, I understand. But, anyway, it looks strange, I dont know the details, but theoretically. Splitting by categorical feature is based on a metric (gini, for example), and it should not depend on the value of the categorical feature. As I remember, such tree node contains boolean condition, like: category == 1||10||15||25||26. Of course, if node has condition like category <26 there are no questions.

The split finding in categorical feature is first to sort, then find the split from left to right (and right to left).
Therefore, some categories with the same count, may result in the different sort order, when using different encoding, and silghtly cause the different result.

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants