ENH Adds feature names support to dataframe protocol #26464

thomasjpfan · 2023-05-30T18:30:58Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

This PR allows all estimators to recognize the feature names from any DataFrame that follows the DataFrame interchange protocol. With this PR, DataFrames that support the interchange protocol and works with np.asarray will work with scikit-learn estimators.

thomasjpfan · 2023-05-30T22:23:01Z

I want to add this pyarrow test:

def test_check_feature_names_arrow():
    pa = importorskip("pyarrow")
    pd = importorskip("pandas")
    df = pd.DataFrame(
        {
            "n_legs": [None, 4, 5, None],
            "animals": ["Flamingo", "Horse", None, "Centipede"],
        },
        columns=["n_legs", "animals"],
    )
    table = pa.Table.from_pandas(df)

    table_names = _get_feature_names(table)
    assert_array_equal(table_names, df.columns)

but that adds a pyarrow dependency to our CI and pyarrow is quite a large dependency.

ogrisel · 2023-05-31T09:38:48Z

but that adds a pyarrow dependency to our CI and pyarrow is quite a large dependency.

I think it's ok to either add pyarrow or polars to one of our CI runs.

I think we would probably benefit from a real non-pandas dataframe integration test. But maybe this can come latter, once we implement generic dataframe support for cross-validation and meta-estimators (e.g. column transformer, pipeline, searchcv, bagging...).

ogrisel

LGTM!

sklearn/tests/test_base.py

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adrinjalali

Same as @ogrisel I'd be happy to have a polars or some other dataframe lib included in the CI.

adrinjalali · 2023-06-01T13:39:09Z

sklearn/utils/validation.py

@@ -1986,8 +1986,11 @@ def _get_feature_names(X):
    feature_names = None

    # extract feature names for support array containers
-    if hasattr(X, "columns"):
+    if hasattr(X, "columns") and hasattr(X, "iloc"):


we REALLY need a _is_pandas_df util 🙈

Yeah I think this would still catch other DataFrame libraries like Modin that have an iloc attribute as well, in which case might be better to have the __dataframe__ check first?

I'm trying not to introduce a regression with pandas. Specifically df.__dataframe__ with the default allow_copy=True can make copies.

The better approach is use the DataFrame API SPEC. Concretely, DataFrame.get_column_names. Currently, there is no library that supports the SPEC yet.

There is an implementation of the DataFrame SPEC for pandas and polars: https://github.com/MarcoGorelli/impl-dataframe-api. The implementation for each library is one file each so it will not be hard for us to vendor and make use of the SPEC before the libraries adopt it.

As for this PR, I introduced a _is_pandas_df function that makes sure that X is a pandas DataFrame.

+1 for temporarily vendoring the compat layer and try to only use the spec API in our estimators' code.

The above can be done in a subsequent PR, no need to delay this one.

adrinjalali · 2023-06-01T13:48:08Z

re:pyarrow, since pandas is thinking of making it a mandatory dependency anyway, I'd e happy to have it in one CI. But since it's large, rather not have it in every CI job.

lorentzenchr

LGTM
As @adrinjalali says, some CI configurations with pyarrow or polars would be important, IMHO.

doc/whats_new/v1.3.rst

…rame_protocol

ogrisel

The ridge failure seems unrelated, I will open a dedicated issue if none already exist.

sklearn/utils/validation.py

sklearn/tests/test_base.py

github-actions · 2023-06-21T11:22:44Z

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 9b3f26e

…rame_protocol

sklearn/utils/_testing.py

sklearn/utils/tests/test_validation.py

glemaitre

Otherwise LGTM.

…rame_protocol_bk

sklearn/utils/validation.py

sklearn/tests/test_base.py

) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

thomasjpfan added 2 commits May 30, 2023 14:27

ENH Adds feature names support to dataframe protocol

b49d391

DOC Adds PR number

130db8e

github-actions bot added the module:utils label May 30, 2023

thomasjpfan mentioned this pull request May 30, 2023

ENH Support dataframe exchange protocol in ColumnTransformer as input #26115

Closed

thomasjpfan added 2 commits May 30, 2023 18:10

FIX Use list

412084e

CLN Only use columns for dataframe like objects

0d7945e

DOC Adds more information

66cdb93

glemaitre mentioned this pull request May 31, 2023

Added Support for polars #26435

Closed

ogrisel approved these changes May 31, 2023

View reviewed changes

sklearn/tests/test_base.py Outdated Show resolved Hide resolved

Apply suggestions from code review

e36e052

Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

adrinjalali approved these changes Jun 1, 2023

View reviewed changes

lorentzenchr approved these changes Jun 2, 2023

View reviewed changes

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

doc/whats_new/v1.3.rst Outdated Show resolved Hide resolved

thomasjpfan added 7 commits June 13, 2023 13:54

DOC Update docstring

1bf440f

Merge remote-tracking branch 'upstream/main' into feature_names_dataf…

42287e7

…rame_protocol

TST Adds polars and pyarrow test

42e741e

ENH Adds _is_pandas_df

42d2412

Merge remote-tracking branch 'upstream/main' into feature_names_dataf…

b42c6fa

…rame_protocol

REV Revert env changes

3f56022

CI Use conda-forge to enable pyarrow and polars tests

adb1354

ogrisel reviewed Jun 15, 2023

View reviewed changes

sklearn/utils/validation.py Show resolved Hide resolved

sklearn/tests/test_base.py Outdated Show resolved Hide resolved

ogrisel mentioned this pull request Jun 15, 2023

FIX seed in test_ridge_sample_weight_consistency [all random seeds] #26589

Merged

CLN Adjust comments

73342a6

thomasjpfan added 3 commits June 21, 2023 13:28

DOC Better docs

879e960

Merge remote-tracking branch 'upstream/main' into feature_names_dataf…

acb8183

…rame_protocol

DOC Move changelog to 1.4

f8118ed

glemaitre reviewed Jun 21, 2023

View reviewed changes

sklearn/utils/_testing.py Outdated Show resolved Hide resolved

glemaitre reviewed Jun 21, 2023

View reviewed changes

sklearn/utils/tests/test_validation.py Show resolved Hide resolved

glemaitre approved these changes Jun 21, 2023

View reviewed changes

CLN Address comments

ef432a7

jovan-stojanovic mentioned this pull request Jun 21, 2023

[WIP] FIX Add tests for pyarrow dtypes in pandas Dataframes #26651

Draft

thomasjpfan added 3 commits June 21, 2023 18:09

WIP

c75ac23

Merge remote-tracking branch 'upstream/main' into feature_names_dataf…

a25581f

…rame_protocol_bk

STY Fix

8e524e9

ogrisel reviewed Jun 21, 2023

View reviewed changes

sklearn/utils/validation.py Show resolved Hide resolved

Add comment to _get_feature_names to explain pandas special casing

5e9cb61

ogrisel enabled auto-merge (squash) June 21, 2023 22:44

ogrisel reviewed Jun 21, 2023

View reviewed changes

sklearn/tests/test_base.py Show resolved Hide resolved

ogrisel added 2 commits June 22, 2023 00:46

Link to issue to track pyarrow / asarray bug

bf858fc

Merge branch 'main' into feature_names_dataframe_protocol

9b3f26e

ogrisel merged commit 0e1f170 into scikit-learn:main Jun 21, 2023

thomasjpfan mentioned this pull request Jul 30, 2023

ENH Adds polars output support to ColumnTransformer #26683

Merged

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023

ENH Adds feature names support to dataframe protocol (scikit-learn#26464

4c2e207

) Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>

krz mentioned this pull request Mar 25, 2024

Support polars data frames py-why/dowhy#1151

Open

lorentzenchr mentioned this pull request Dec 14, 2024

Support other dataframes like polars and pyarrow not just pandas #25896

Closed

Uh oh!

ENH Adds feature names support to dataframe protocol #26464

ENH Adds feature names support to dataframe protocol #26464

Conversation

thomasjpfan commented May 30, 2023

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

thomasjpfan commented May 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel commented May 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

adrinjalali left a comment

Choose a reason for hiding this comment

Uh oh!

adrinjalali Jun 1, 2023

Choose a reason for hiding this comment

Uh oh!

mroeschke Jun 8, 2023

Choose a reason for hiding this comment

Uh oh!

thomasjpfan Jun 14, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

ogrisel Jun 15, 2023

Choose a reason for hiding this comment

Uh oh!

adrinjalali commented Jun 1, 2023

Uh oh!

lorentzenchr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✔️ Linting Passed

Uh oh!

Uh oh!

Uh oh!

glemaitre left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan commented May 30, 2023 •

edited

Loading

ogrisel commented May 31, 2023 •

edited

Loading

github-actions bot commented Jun 21, 2023 •

edited

Loading