feat: make ColumnDropper dataframe-agnostic #655

MarcoGorelli · 2024-05-05T13:46:54Z

Hey 👋 I discussed a bit with Francesco, who discussed it with Vincent. So, I was keen to get the ball rolling

Description

Making a start towards dataframe-agnosticism. For now, this only makes ColumnDropped dataframe-agnostic. It requires adding an extremely lightweight dependecy (Narwhals), but the result is that computation can happen natively for pandas/Polars/modin/cuDF (and any other library which may want to become Narwhals-compliant), without any data conversion.

pandas is still a required dependency, as it's required for the rest of scikit-lego. But, it doesn't seem that far-fetched to be able to make all of scikit-lego dataframe-agnostic. If you're open to this, we could do it in stages?

There is no impact on current users, other than that Narwhals would be a required dependency (though it's really lightweight, and without any dependencies itself, so there's no risk for conflicts)

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

My code follows the style guidelines (ruff)
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation (also to the readme.md)
I have added tests that prove my fix is effective or that my feature works
I have added tests to check whether the new feature adheres to the sklearn convention
New and existing unit tests pass locally with my changes

Not sure about some of these items so I haven't checked them.

Demo

pandas users, they can keep using scikit-lego, without needing Polars to be installed

$ pip uninstall polars -y
Found existing installation: polars 0.20.23
Uninstalling polars-0.20.23:
  Successfully uninstalled polars-0.20.23
$ cat t.py 
import pandas as pd
from sklego.preprocessing import ColumnDropper

df = pd.DataFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})
print(ColumnDropper(["name"]).fit_transform(df))
$ python t.py 
   length  shoesize
0    1.82        42
1    1.85        44
2    1.80        45

But, Polars users can use this, without it doing any conversion to pandas. In particular, Polars LazyFrame is supported (and stays lazy, but here's I'm calling .collect to show you the output):

$ cat t.py 
import polars as pl
from sklego.preprocessing import ColumnDropper

df = pl.LazyFrame({
    "name": ["Swen", "Victor", "Alex"],
    "length": [1.82, 1.85, 1.80],
    "shoesize": [42, 44, 45]
})
result = ColumnDropper(["name"]).fit_transform(df)
print(result.explain())
print(result.collect())
$ python t.py 
DF ["name", "length", "shoesize"]; PROJECT 2/3 COLUMNS; SELECTION: "None"
shape: (3, 2)
┌────────┬──────────┐
│ length ┆ shoesize │
│ ---    ┆ ---      │
│ f64    ┆ i64      │
╞════════╪══════════╡
│ 1.82   ┆ 42       │
│ 1.85   ┆ 44       │
│ 1.8    ┆ 45       │
└────────┴──────────┘

MarcoGorelli · 2024-05-05T13:56:44Z

sklego/preprocessing/pandastransformers.py

-    @staticmethod
-    def _check_X_for_type(X):
-        """Checks if input of the Selector is of the required dtype"""
-        if not isinstance(X, pd.DataFrame):
-            raise TypeError("Provided variable X is not of type pandas.DataFrame")


I'm removing this here, as nw.from_native already raises a similar error. For example, if you tried passing in a numpy array here, you'd get:

TypeError: Expected pandas-like dataframe, Polars dataframe, or Polars lazyframe, got: <class 'numpy.ndarray'>

FBruzzesi

Honestly it looks quite neat! I left some considerations for better sync with narwhals in the long term.

Edit: Regarding macos tests, we are getting a lot of fails since latest upgraded to macos-14. We definitely need to get a closer look at those, but completely unrelated to the PR

pyproject.toml

sklego/preprocessing/pandastransformers.py

tests/test_preprocessing/test_columndropper.py

…raries

sklego/preprocessing/pandastransformers.py

koaning · 2024-05-06T19:21:45Z

@FBruzzesi notice those failing test runs, I have a creeping suspicion that these may be caused by uv and they feel similar to the failing cronjobs. Just to check, did you have a look?

FBruzzesi · 2024-05-06T21:02:14Z

@koaning I finally took a look at the CI...and you are right 😁

Moving back to pip doesn't break when building wheel (proof on my fork)
uv has some related issue when building wheels from scratch and qdldl has no built distribution for macos 10+ and arm architecture.

As a side note: even with pip, the --pre flag break as of scikit-learn==1.5.0rc1 (proof on my fork).

FBruzzesi · 2024-05-07T04:49:52Z

@koaning another food for thought is that... all these transformers are in a ../pandastransformers.py module, however this should be transparent to the user as they are import-able from sklego.preprocessing directly. I would feel fairly safe to change the filename once all of them are dataframe agnostic

koaning · 2024-05-07T05:04:02Z

@FBruzzesi any objections to moving back to pip for now? I prefer to wait until uv is a bit more reliable before relying our CI on it.

I made an issue for that v1.5rc01 test failure on the sklearn side, feels like it might be breaking so upstream should know about it.

I would feel fairly safe to change the filename once all of them are dataframe agnostic.

Totally!

FBruzzesi

I am biased as I added a few commits myself 😂

koaning · 2024-05-07T09:34:02Z

Do we want to release early with a micro release by the way or would we prefer to make a bigger splash? There's something to be said for both, but given that we are introducing a dependency ... might be good to make a larger release?

anopsy · 2024-05-07T09:40:19Z

If everybody is fine with it, I could finish pandastransformers.py, or at least have a shot at ColumnSelector and see how it goes

FBruzzesi · 2024-05-07T10:56:48Z

@koaning I would personally opt for one somewhat big release. We can wait to merge the PRs related to Narwhals migration to have some space for bug fixes if needed.

@anopsy Of course feel free to develop that. I will create an issue to track progress 😇

MarcoGorelli · 2024-05-07T17:05:25Z

Just got a question about

We can wait to merge the PRs related to Narwhals migration to have some space for bug fixes if needed.

No objections to waiting for a bigger release to include this, I'm just worried that it's going to become quite hard to review if everything is done in a single PR, or if PRs build on top of other open PRs

May I suggest that you (i.e. a scikit-lego committer) make a narwhals branch in scikit-lego, we make PRs to that branch, and then if/when you're ready, you merge the narwhals branch into main and include it in the release?

FBruzzesi · 2024-05-07T18:08:51Z

@MarcoGorelli that's very reasonable! Here you have it: narwhals-development branch.

koaning · 2024-05-07T18:37:33Z

FWIW I think pushing to main is fine too. It's more of a "oh, we have one new feature, usually that means deploy". But we can also collect a bunch of things to main ... this is the main thing we're going to work on short-term.

* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <francesco.bruzzesi.93@gmail.com> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: Magdalena Anopsy <74981211+anopsy@users.noreply.github.com> Co-authored-by: Dea María Léon <deamarialeon@gmail.com>

MarcoGorelli force-pushed the dataframe-agnostic-columndropper branch from 97225a1 to 8f22154 Compare May 5, 2024 13:50

MarcoGorelli commented May 5, 2024

View reviewed changes

MarcoGorelli marked this pull request as ready for review May 5, 2024 13:57

MarcoGorelli changed the title ~~feat: make ColumnDropped dataframe-agnostic~~ feat: make ColumnDropper dataframe-agnostic May 5, 2024

feat: make ColumnDropped dataframe-agnostic

f626eaf

MarcoGorelli force-pushed the dataframe-agnostic-columndropper branch from 8f22154 to f626eaf Compare May 5, 2024 18:52

FBruzzesi reviewed May 6, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

sklego/preprocessing/pandastransformers.py Outdated Show resolved Hide resolved

tests/test_preprocessing/test_columndropper.py Show resolved Hide resolved

use narwhals[polars] in pyproject.toml, link to list of supported lib…

36d9524

…raries

koaning reviewed May 6, 2024

View reviewed changes

sklego/preprocessing/pandastransformers.py Show resolved Hide resolved

note that narwhals is used for cross-dataframe support

a311696

MarcoGorelli force-pushed the dataframe-agnostic-columndropper branch from bf23559 to a311696 Compare May 6, 2024 20:02

test refactor

ea17b3c

FBruzzesi mentioned this pull request May 7, 2024

ci: rollback to pip #657

Merged

3 tasks

FBruzzesi added 2 commits May 7, 2024 10:04

Merge branch 'main' into dataframe-agnostic-columndropper

6d80009

docstrings

cad5fbf

FBruzzesi approved these changes May 7, 2024

View reviewed changes

FBruzzesi mentioned this pull request May 7, 2024

[FEATURE] Narwhals migration for dataframe-agnostic codebase #658

Closed

MarcoGorelli changed the base branch from main to narwhals-development May 7, 2024 18:19

FBruzzesi merged commit aac1d9d into koaning:narwhals-development May 7, 2024
17 checks passed

FBruzzesi mentioned this pull request May 18, 2024

feat: Narwhals for dataframe-agnostic codebase #671

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: make ColumnDropper dataframe-agnostic #655

feat: make ColumnDropper dataframe-agnostic #655

MarcoGorelli commented May 5, 2024 •

edited

Loading

MarcoGorelli May 5, 2024

FBruzzesi left a comment •

edited

Loading

koaning commented May 6, 2024

FBruzzesi commented May 6, 2024

FBruzzesi commented May 7, 2024 •

edited

Loading

koaning commented May 7, 2024

FBruzzesi left a comment

koaning commented May 7, 2024

anopsy commented May 7, 2024 •

edited

Loading

FBruzzesi commented May 7, 2024

MarcoGorelli commented May 7, 2024

FBruzzesi commented May 7, 2024

koaning commented May 7, 2024

feat: make ColumnDropper dataframe-agnostic #655

feat: make ColumnDropper dataframe-agnostic #655

Conversation

MarcoGorelli commented May 5, 2024 • edited Loading

Description

Type of change

Checklist:

Demo

MarcoGorelli May 5, 2024

Choose a reason for hiding this comment

FBruzzesi left a comment • edited Loading

Choose a reason for hiding this comment

koaning commented May 6, 2024

FBruzzesi commented May 6, 2024

FBruzzesi commented May 7, 2024 • edited Loading

koaning commented May 7, 2024

FBruzzesi left a comment

Choose a reason for hiding this comment

koaning commented May 7, 2024

anopsy commented May 7, 2024 • edited Loading

FBruzzesi commented May 7, 2024

MarcoGorelli commented May 7, 2024

FBruzzesi commented May 7, 2024

koaning commented May 7, 2024

MarcoGorelli commented May 5, 2024 •

edited

Loading

FBruzzesi left a comment •

edited

Loading

FBruzzesi commented May 7, 2024 •

edited

Loading

anopsy commented May 7, 2024 •

edited

Loading