Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] Support feature_names_in_ attribute via sklearn API #6279

Closed
ravwojdyla opened this issue Jan 18, 2024 · 6 comments · Fixed by #6310
Closed

[python-package] Support feature_names_in_ attribute via sklearn API #6279

ravwojdyla opened this issue Jan 18, 2024 · 6 comments · Fixed by #6310

Comments

@ravwojdyla
Copy link

Summary

sklearn API supports feature_names_in_ attribute on a fitted model (SLEP007), which remembers the feature names/columns that went into the model.fit method. This can be very useful information, and is a standard worth conforming to. Afaiu right now that information is available in the booster:

est.booster_.feature_name()

It shouldn't be too hard to conform to also expose that information via feature_names_in_ attribute 🙏

Motivation

It would conform to the sklearn API standards, improve usability of LightGBM models, especially when used along with other sklearn models and Pipelines.

References

@jameslamb jameslamb changed the title Support feature_names_in_ attribute via sklearn API [python-package] Support feature_names_in_ attribute via sklearn API Jan 19, 2024
@jameslamb
Copy link
Collaborator

Thanks for using LightGBM and taking the time to report this!

We'd welcome this addition, would you like to contribute it?


And a side question.... do you think it's an oversight that scikit-learn's estimator checks don't enforce this? We follow https://scikit-learn.org/stable/modules/generated/sklearn.utils.estimator_checks.check_estimator.html in LightGBM's tests to try to catch such things

@parametrize_with_checks([lgb.LGBMClassifier(), lgb.LGBMRegressor()])
def test_sklearn_integration(estimator, check):
estimator.set_params(min_child_samples=1, min_data_in_bin=1)
check(estimator)

Using scikit-learn==1.3.2 (the latest released version as of this writing), check_estimator() says LGBMClassifier and LGBMRegressor are compliant with scikit-learn's expectations for estimators.

import lightgbm as lgb
from sklearn.utils.estimator_checks import check_estimator

check_estimator(lgb.LGBMClassifier())
check_estimator(lgb.LGBMRegressor())

But in the SLEP you linked, it says the following:

Backward Compatibility
All estimators should implement the feature_names_in_ and get_feature_names_out() API. This is checked in check_estimator...

@nicklamiller
Copy link
Contributor

I would very much like to contribute to LightGBM and this seems like a great issue, with @ravwojdyla's blessing, I'd be happy to make this contribution.

@ravwojdyla
Copy link
Author

@nicklamiller sounds great - thank you!

@jameslamb
Copy link
Collaborator

Do either of you know the answer to my question about check_estimator() from the latest scikit-learn not complaining about this?

@nicklamiller
Copy link
Contributor

nicklamiller commented Jan 26, 2024

Backward Compatibility
All estimators should implement the feature_names_in_ and get_feature_names_out() API. This is checked in check_estimator...

@jameslamb I agree that based on SLEP007, this functionality should be implemented in check_estimator and does not appear to be. Here's a somewhat recent issue of sklearn estimators that lack(ed) this attribute, it looks like the attribute it is only checked/created if missing when _validate_data is called.

I can open an issue in sklearn and propose this behavior is more rigorously checked with check_estimator.

@jameslamb
Copy link
Collaborator

Thanks very much for the link to scikit-learn/scikit-learn#27907 @nicklamiller !

Please link to this issue from whatever one you create in scikit-learn.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants