-
-
Notifications
You must be signed in to change notification settings - Fork 25.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Adds feature names support to dataframe protocol #26464
ENH Adds feature names support to dataframe protocol #26464
Conversation
I want to add this def test_check_feature_names_arrow():
pa = importorskip("pyarrow")
pd = importorskip("pandas")
df = pd.DataFrame(
{
"n_legs": [None, 4, 5, None],
"animals": ["Flamingo", "Horse", None, "Centipede"],
},
columns=["n_legs", "animals"],
)
table = pa.Table.from_pandas(df)
table_names = _get_feature_names(table)
assert_array_equal(table_names, df.columns) but that adds a |
I think it's ok to either add pyarrow or polars to one of our CI runs. I think we would probably benefit from a real non-pandas dataframe integration test. But maybe this can come latter, once we implement generic dataframe support for cross-validation and meta-estimators (e.g. column transformer, pipeline, searchcv, bagging...). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as @ogrisel I'd be happy to have a polars or some other dataframe lib included in the CI.
sklearn/utils/validation.py
Outdated
@@ -1986,8 +1986,11 @@ def _get_feature_names(X): | |||
feature_names = None | |||
|
|||
# extract feature names for support array containers | |||
if hasattr(X, "columns"): | |||
if hasattr(X, "columns") and hasattr(X, "iloc"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we REALLY need a _is_pandas_df
util 🙈
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this would still catch other DataFrame libraries like Modin that have an iloc
attribute as well, in which case might be better to have the __dataframe__
check first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying not to introduce a regression with pandas. Specifically df.__dataframe__
with the default allow_copy=True
can make copies.
The better approach is use the DataFrame API SPEC. Concretely, DataFrame.get_column_names. Currently, there is no library that supports the SPEC yet.
There is an implementation of the DataFrame SPEC for pandas and polars: https://github.com/MarcoGorelli/impl-dataframe-api. The implementation for each library is one file each so it will not be hard for us to vendor and make use of the SPEC before the libraries adopt it.
As for this PR, I introduced a _is_pandas_df
function that makes sure that X
is a pandas DataFrame.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for temporarily vendoring the compat layer and try to only use the spec API in our estimators' code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above can be done in a subsequent PR, no need to delay this one.
re:pyarrow, since pandas is thinking of making it a mandatory dependency anyway, I'd e happy to have it in one CI. But since it's large, rather not have it in every CI job. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
As @adrinjalali says, some CI configurations with pyarrow or polars would be important, IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ridge failure seems unrelated, I will open a dedicated issue if none already exist.
✔️ Linting PassedAll linting checks passed. Your pull request is in excellent shape! ☀️ Generated for commit: 9b3f26e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM.
Reference Issues/PRs
Towards #25896
What does this implement/fix? Explain your changes.
This PR allows all estimators to recognize the feature names from any DataFrame that follows the DataFrame interchange protocol. With this PR, DataFrames that support the interchange protocol and works with
np.asarray
will work with scikit-learn estimators.