[python] Re-enable scikit-learn 0.22+ support #2946

rth · 2020-03-26T16:37:07Z

As discussed in #2628 (comment) re-enables scikit-learn 0.22+ support.

Skips the check_no_attributes_set_in_init common check in check_estimator. Hopefully it will be fixed before 0.23 scikit-learn/scikit-learn#16241

Reverts #2637

Another downside of not supporting scikit-learn 0.22+ is that users will have to compile scikit-learn 0.21.3 from sources on Python 3.8 since there are no wheels there. On conda it would mean that it's simply not possible to install lightgbm on Python 3.8 if you also have scikit-learn installed.

msftclas · 2020-03-26T16:37:21Z

All CLA requirements met.

StrikerRUS

Thanks for the update! Really excited that we now have approval of that check_no_attributes_set_in_init can be safely skipped.

rth · 2020-03-26T16:51:25Z

Thanks @StrikerRUS !

Really excited that we now have approval of that check_no_attributes_set_in_init can be safely skipped.

To be more specific, setting private attributes in __init__ is fine under the following condition,

scikit-learn/scikit-learn#16241 (comment)
as long as the public attributes defined as constructor params (that is the model hyperparametrs) can be get/set at any moment either via getattr / setattr or via get_params / set_params and that subsequent calls to the fit / predict / transform automatically take those changes into account.

It would be good to double check that this is verified. But in any case, that applies to any scikit-learn version not just scikit-learn 0.22+ so this PR is still necessary in any case.

StrikerRUS · 2020-03-26T17:02:29Z

I like the comment you're referring! It makes a lot sense!

Unfortunately, it seems that new scikit-learn version brings more tests we are not passing.

=================================== FAILURES ===================================
_____________________ TestSklearn.test_sklearn_integration _____________________695
self = <test_sklearn.TestSklearn testMethod=test_sklearn_integration>
    @unittest.skipIf(sk_version < '0.19.0', 'scikit-learn version is less than 0.19')
        def test_sklearn_integration(self):
            # we cannot use `check_estimator` directly since there is no skip test mechanism
            for name, estimator in ((lgb.sklearn.LGBMClassifier.__name__, lgb.sklearn.LGBMClassifier),
                                    (lgb.sklearn.LGBMRegressor.__name__, lgb.sklearn.LGBMRegressor)):
                check_parameters_default_constructible(name, estimator)
                # we cannot leave default params (see https://github.com/microsoft/LightGBM/issues/833)
                estimator = estimator(min_child_samples=1, min_data_in_bin=1)
                for check in _yield_all_checks(name, estimator):
                    check_name = check.func.__name__ if hasattr(check, 'func') else check.__name__
                    if check_name == 'check_estimators_nan_inf':
                        continue  # skip test because LightGBM deals with nan
                    elif (check_name == "check_no_attributes_set_in_init"
                          and sk_version < '0.23.0'):
                        # scikit-learn <0.23 incorrectly asserts that private attributes
                        # cannot be set in __init__.
                        continue
                    try:
                    >                   check(name, estimator)
../tests/python_package_test/test_sklearn.py:302: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/_testing.py:327: in wrapper
    return fn(*args, **kwargs)
    ../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/estimator_checks.py:743: in check_sample_weights_not_an_array
    estimator.fit(X, y, sample_weight=weights)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/sklearn.py:830: in fit
    callbacks=callbacks, init_model=init_model)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/sklearn.py:615: in fit
    callbacks=callbacks, init_model=init_model)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/engine.py:228: in train
    booster = Booster(params=params, train_set=train_set)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1782: in __init__
    train_set.construct()
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1156: in construct
    categorical_feature=self.categorical_feature, params=self.params)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:970: in _lazy_init
    self.set_weight(weight)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1503: in set_weight
    weight = list_to_1d_numpy(weight, name='weight')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = <sklearn.utils.estimator_checks._NotAnArray object at 0x7f445c884f28>
dtype = <class 'numpy.float32'>, name = 'weight'
    def list_to_1d_numpy(data, dtype=np.float32, name='list'):
            """Convert data to numpy 1-D array."""
            if is_numpy_1d_array(data):
                if data.dtype == dtype:
                    return data
                else:
                    return data.astype(dtype=dtype, copy=False)
            elif is_1d_list(data):
                return np.array(data, dtype=dtype, copy=False)
            elif isinstance(data, Series):
                if _get_bad_pandas_dtypes([data.dtypes]):
                    raise ValueError('Series.dtypes must be int, float or bool')
                return np.array(data, dtype=dtype, copy=False)  # SparseArray should be supported as well
            else:
                raise TypeError("Wrong type({0}) for {1}.\n"
                >                           "It should be list, numpy 1-D array or pandas Series".format(type(data).__name__, name))
    E           TypeError: Wrong type(_NotAnArray) for weight.
    E           It should be list, numpy 1-D array or pandas Series
../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:85: TypeError

If you wish, I can pick this PR and continue re-enabling support of new versions. I understand that you have more important things to do rather than fixing third-party libraries 🙂 . Approval of skipping the test is more than enough! We really appreciate the support!

rth · 2020-03-26T17:16:22Z

If you wish, I can pick this PR and continue re-enabling support of new versions.

Thank you, that would be great @StrikerRUS ! BTW, I opened a minor maintenance issue that might make using these tests a bit easier in the future #2947

jameslamb · 2020-03-28T03:00:13Z

If you rebase to master, to get the changes from #2954 , the R test issues should be resolved. That will also fix the weird state of the Travis build, which was a result of the recent Travis outage.

Sorry for the inconvenience!

ekerazha · 2020-04-02T09:00:35Z

In my opinion there's a big issue with the current scikit-learn wrapper.

In general, most libraries in the scikit-learn ecosystem (or sklearn itself) expect a fit() method where you only pass X and y (and maybe sample_weight).

In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.

For example, catboost https://catboost.ai/docs/concepts/python-reference_catboost.html allows to set most parameters through the constructor (you can set early_stopping_round both when you create the estimator object or in the fit method).

For example, if I want to create a StackedClassifier https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html it's not clear how to pass the additional LightGBM fit() parameters to the StackedClassifier wrapper.

In the past I created a custom LightGBM compatibility layer where I passed parameters to the constructor (inside a fit_params dictionary) that were used when calling LightGBM's fit() method.

I think we should definitely move as much as parameters as possible out of the fit() method to improve compatibility with the sklearn ecosystem.

rth · 2020-04-02T09:06:12Z

In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.

+1 but that's independent of scikit-learn version so it would be good to merge this PR in any case :)

Skip check_no_attributes_set_in_init in check_estimator

9114b9a

rth requested review from chivee, guolinke, henry0312, huanzhang12, jameslamb, Laurae2, StrikerRUS and wxchan as code owners March 26, 2020 16:37

Fix typo

3c0c798

rth mentioned this pull request Mar 26, 2020

[RFC] compatibility with scikit-learn #2628

Closed

StrikerRUS approved these changes Mar 26, 2020

View reviewed changes

Another occurence in CI

3e9b1f7

StrikerRUS added the feature label Mar 26, 2020

StrikerRUS changed the title ~~Re-enable scikit-learn 0.22+ support~~ [python] Re-enable scikit-learn 0.22+ support Mar 26, 2020

StrikerRUS mentioned this pull request Mar 27, 2020

[python] Re-enable scikit-learn 0.22+ support #2949

Merged

ekerazha mentioned this pull request Apr 2, 2020

The sklearn wrapper is not really compatible with the sklearn ecosystem #2966

Closed

henry0312 closed this in #2949 Apr 10, 2020

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python] Re-enable scikit-learn 0.22+ support #2946

[python] Re-enable scikit-learn 0.22+ support #2946

rth commented Mar 26, 2020

msftclas commented Mar 26, 2020 •

edited

Loading

StrikerRUS left a comment

rth commented Mar 26, 2020

StrikerRUS commented Mar 26, 2020

rth commented Mar 26, 2020

jameslamb commented Mar 28, 2020

ekerazha commented Apr 2, 2020 •

edited

Loading

rth commented Apr 2, 2020

[python] Re-enable scikit-learn 0.22+ support #2946

[python] Re-enable scikit-learn 0.22+ support #2946

Conversation

rth commented Mar 26, 2020

msftclas commented Mar 26, 2020 • edited Loading

StrikerRUS left a comment

Choose a reason for hiding this comment

rth commented Mar 26, 2020

StrikerRUS commented Mar 26, 2020

rth commented Mar 26, 2020

jameslamb commented Mar 28, 2020

ekerazha commented Apr 2, 2020 • edited Loading

rth commented Apr 2, 2020

msftclas commented Mar 26, 2020 •

edited

Loading

ekerazha commented Apr 2, 2020 •

edited

Loading