Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Re-enable scikit-learn 0.22+ support #2946

Closed

Conversation

rth
Copy link

@rth rth commented Mar 26, 2020

As discussed in #2628 (comment) re-enables scikit-learn 0.22+ support.

Skips the check_no_attributes_set_in_init common check in check_estimator. Hopefully it will be fixed before 0.23 scikit-learn/scikit-learn#16241

Reverts #2637

Another downside of not supporting scikit-learn 0.22+ is that users will have to compile scikit-learn 0.21.3 from sources on Python 3.8 since there are no wheels there. On conda it would mean that it's simply not possible to install lightgbm on Python 3.8 if you also have scikit-learn installed.

@msftclas
Copy link

msftclas commented Mar 26, 2020

CLA assistant check
All CLA requirements met.

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update! Really excited that we now have approval of that check_no_attributes_set_in_init can be safely skipped.

@StrikerRUS StrikerRUS changed the title Re-enable scikit-learn 0.22+ support [python] Re-enable scikit-learn 0.22+ support Mar 26, 2020
@rth
Copy link
Author

rth commented Mar 26, 2020

Thanks @StrikerRUS !

Really excited that we now have approval of that check_no_attributes_set_in_init can be safely skipped.

To be more specific, setting private attributes in __init__ is fine under the following condition,

scikit-learn/scikit-learn#16241 (comment)
as long as the public attributes defined as constructor params (that is the model hyperparametrs) can be get/set at any moment either via getattr / setattr or via get_params / set_params and that subsequent calls to the fit / predict / transform automatically take those changes into account.

It would be good to double check that this is verified. But in any case, that applies to any scikit-learn version not just scikit-learn 0.22+ so this PR is still necessary in any case.

@StrikerRUS
Copy link
Collaborator

I like the comment you're referring! It makes a lot sense!

Unfortunately, it seems that new scikit-learn version brings more tests we are not passing.

=================================== FAILURES ===================================
_____________________ TestSklearn.test_sklearn_integration _____________________695
self = <test_sklearn.TestSklearn testMethod=test_sklearn_integration>
    @unittest.skipIf(sk_version < '0.19.0', 'scikit-learn version is less than 0.19')
        def test_sklearn_integration(self):
            # we cannot use `check_estimator` directly since there is no skip test mechanism
            for name, estimator in ((lgb.sklearn.LGBMClassifier.__name__, lgb.sklearn.LGBMClassifier),
                                    (lgb.sklearn.LGBMRegressor.__name__, lgb.sklearn.LGBMRegressor)):
                check_parameters_default_constructible(name, estimator)
                # we cannot leave default params (see https://github.com/microsoft/LightGBM/issues/833)
                estimator = estimator(min_child_samples=1, min_data_in_bin=1)
                for check in _yield_all_checks(name, estimator):
                    check_name = check.func.__name__ if hasattr(check, 'func') else check.__name__
                    if check_name == 'check_estimators_nan_inf':
                        continue  # skip test because LightGBM deals with nan
                    elif (check_name == "check_no_attributes_set_in_init"
                          and sk_version < '0.23.0'):
                        # scikit-learn <0.23 incorrectly asserts that private attributes
                        # cannot be set in __init__.
                        continue
                    try:
                    >                   check(name, estimator)
../tests/python_package_test/test_sklearn.py:302: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/_testing.py:327: in wrapper
    return fn(*args, **kwargs)
    ../../../../miniconda/envs/test-env/lib/python3.6/site-packages/sklearn/utils/estimator_checks.py:743: in check_sample_weights_not_an_array
    estimator.fit(X, y, sample_weight=weights)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/sklearn.py:830: in fit
    callbacks=callbacks, init_model=init_model)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/sklearn.py:615: in fit
    callbacks=callbacks, init_model=init_model)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/engine.py:228: in train
    booster = Booster(params=params, train_set=train_set)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1782: in __init__
    train_set.construct()
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1156: in construct
    categorical_feature=self.categorical_feature, params=self.params)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:970: in _lazy_init
    self.set_weight(weight)
    ../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:1503: in set_weight
    weight = list_to_1d_numpy(weight, name='weight')
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
data = <sklearn.utils.estimator_checks._NotAnArray object at 0x7f445c884f28>
dtype = <class 'numpy.float32'>, name = 'weight'
    def list_to_1d_numpy(data, dtype=np.float32, name='list'):
            """Convert data to numpy 1-D array."""
            if is_numpy_1d_array(data):
                if data.dtype == dtype:
                    return data
                else:
                    return data.astype(dtype=dtype, copy=False)
            elif is_1d_list(data):
                return np.array(data, dtype=dtype, copy=False)
            elif isinstance(data, Series):
                if _get_bad_pandas_dtypes([data.dtypes]):
                    raise ValueError('Series.dtypes must be int, float or bool')
                return np.array(data, dtype=dtype, copy=False)  # SparseArray should be supported as well
            else:
                raise TypeError("Wrong type({0}) for {1}.\n"
                >                           "It should be list, numpy 1-D array or pandas Series".format(type(data).__name__, name))
    E           TypeError: Wrong type(_NotAnArray) for weight.
    E           It should be list, numpy 1-D array or pandas Series
../../../../.local/lib/python3.6/site-packages/lightgbm/basic.py:85: TypeError

If you wish, I can pick this PR and continue re-enabling support of new versions. I understand that you have more important things to do rather than fixing third-party libraries 🙂 . Approval of skipping the test is more than enough! We really appreciate the support!

@rth
Copy link
Author

rth commented Mar 26, 2020

If you wish, I can pick this PR and continue re-enabling support of new versions.

Thank you, that would be great @StrikerRUS ! BTW, I opened a minor maintenance issue that might make using these tests a bit easier in the future #2947

@jameslamb
Copy link
Collaborator

If you rebase to master, to get the changes from #2954 , the R test issues should be resolved. That will also fix the weird state of the Travis build, which was a result of the recent Travis outage.

Sorry for the inconvenience!

@ekerazha
Copy link

ekerazha commented Apr 2, 2020

In my opinion there's a big issue with the current scikit-learn wrapper.

In general, most libraries in the scikit-learn ecosystem (or sklearn itself) expect a fit() method where you only pass X and y (and maybe sample_weight).

In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.

For example, catboost https://catboost.ai/docs/concepts/python-reference_catboost.html allows to set most parameters through the constructor (you can set early_stopping_round both when you create the estimator object or in the fit method).

For example, if I want to create a StackedClassifier https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.StackingClassifier.html it's not clear how to pass the additional LightGBM fit() parameters to the StackedClassifier wrapper.

In the past I created a custom LightGBM compatibility layer where I passed parameters to the constructor (inside a fit_params dictionary) that were used when calling LightGBM's fit() method.

I think we should definitely move as much as parameters as possible out of the fit() method to improve compatibility with the sklearn ecosystem.

@rth
Copy link
Author

rth commented Apr 2, 2020

In the current wrapper we also have other params such as early_stopping_round. I think we should move as much parameters as possible from the fit method to the estimator constructor.

+1 but that's independent of scikit-learn version so it would be good to merge this PR in any case :)

@lock lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants