-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Results depend on categorical labels #3273
Comments
@shiyu1994 could you help to investigate this? |
Is there any update on this issue? We're also facing this critical issue in our production system and need the issue resolved urgently. I hope we can get an estimate of the time frame... thank you very much... |
@drexk @FiksII you can try to encode the categories start from test results: >>> df_data_1 = df_data.copy()
>>> df_data_1['category'] = df_data_1['category'].replace({'low': 1, 'mid': 2, 'high': 3}).astype(int)
>>>
>>> for i in range(10):
... dataset_1 = lightgbm.Dataset(data=df_data_1, label=boston_dataset['target'], categorical_feature=['category'])
... cv_result = lightgbm.cv(lgb_params, dataset_1, num_boost_round=500, nfold=2, seed=i, early_stopping_rounds=15, metrics=['l1'], stratified=False)
... metric_name = next(iter(cv_result.keys()))
... print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
23 3.078809914662149
34 3.041569195467846
50 3.022566265676547
62 3.074435363402193
29 3.18402355037978
22 3.1114202718078685
24 3.1386666235755385
51 3.0808353633409604
123 3.1707610124398338
41 3.013074705873579
>>>
>>> df_data_2 = df_data.copy()
>>> df_data_2['category'] = df_data_2['category'].replace({'low': 2, 'mid': 1, 'high': 3}).astype('int')
>>> for i in range(10):
... dataset_2 = lightgbm.Dataset(data=df_data_2, label=boston_dataset['target'], categorical_feature=['category'])
... cv_result = lightgbm.cv(lgb_params, dataset_2, num_boost_round=500, nfold=2, seed=i, early_stopping_rounds=15, metrics=['l1'], stratified=False)
... metric_name = next(iter(cv_result.keys()))
... print(len(cv_result[metric_name]), cv_result[metric_name][-1])
...
23 3.078809914662149
34 3.041569195467846
50 3.022566265676547
62 3.074435363402193
29 3.18402355037978
22 3.1114202718078685
24 3.1386666235755385
51 3.0808353633409604
123 3.1707610124398338
41 3.013074705873579 |
Thank you for attending to this bug quickly. I'll have a quick word with my team about this. Thanks! |
@shiyu1994 I think this seems a potential bug. |
Unfortunately, the bug still exists. Unfortunately, I could not find an example with a smaller number of samples.
Just check...
|
Or more common:
|
what a heart-breaking news. Our production system is bleeding money everyday because of this bug. @FiksII can you recommend an alternative machine learning library that my team can implement and switch to temporarily while waiting for this unfortunate disaster to get fixed? Thanks!!! |
@drexk Well, it's an open-source project, so we can't put pressure on them, I hope the developers are trying their the best. This library is one of the best compared with all others. That's why I'm using it. Good luck with your production system. I hope your team survives. |
Ok... It seems this bug is fixed in 3.0.0 @drexk |
@FiksII @drexk |
@FiksII |
Of course, it's not the production code. But, sometimes, our categories are linked with different numbers. So, what's why I found such bug.
So far, yes. But, I need more time for testing. |
From the code, I think mapping start from 1 should be identical for any mappings, since LightGBM will re-map it in training, according to the count of different categories. |
I appreciate all the pointers I get in this discussion. |
@guolinke. Approved. I have an example there are a lot of categories with 1 sample. And the results are different. |
@FiksII The low-frequency categories will be filtered in LightGBM, so it may not ensure consistency. |
@guolinke Yes, I understand. But, anyway, it looks strange, I dont know the details, but theoretically. Splitting by categorical feature is based on a metric (gini, for example), and it should not depend on the value of the categorical feature. As I remember, such tree node contains boolean condition, like: |
@guolinke Just in case, this is another failed example. LightGBM commit 1804fd1
|
@FiksII why you use so small Besides, you can use larger An test by your example:
|
The split finding in categorical feature is first to sort, then find the split from left to right (and right to left). |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Python package
LightGBM component:
lightgbm==2.3.1
Environment info
Operating System:
Windows 10
Python version: 3.7.6
The results (for example, cv) depend on categorical labels. As an example, I took Boston dataset, added artificial category feature with values ['low', 'mid', 'high']. Then, I encoded it:
dataset_1 is created based on {'low': 0, 'mid': 1, 'high': 2} encoding, dataset_2 is created based on {'low': 1, 'mid': 0, 'high': 2}.
And CV results are different.
The text was updated successfully, but these errors were encountered: