-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: make ColumnDropper dataframe-agnostic #655
feat: make ColumnDropper dataframe-agnostic #655
Conversation
97225a1
to
8f22154
Compare
@staticmethod | ||
def _check_X_for_type(X): | ||
"""Checks if input of the Selector is of the required dtype""" | ||
if not isinstance(X, pd.DataFrame): | ||
raise TypeError("Provided variable X is not of type pandas.DataFrame") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm removing this here, as nw.from_native
already raises a similar error. For example, if you tried passing in a numpy array here, you'd get:
TypeError: Expected pandas-like dataframe, Polars dataframe, or Polars lazyframe, got: <class 'numpy.ndarray'>
8f22154
to
f626eaf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly it looks quite neat! I left some considerations for better sync with narwhals
in the long term.
Edit: Regarding macos tests, we are getting a lot of fails since latest upgraded to macos-14. We definitely need to get a closer look at those, but completely unrelated to the PR
@FBruzzesi notice those failing test runs, I have a creeping suspicion that these may be caused by |
bf23559
to
a311696
Compare
@koaning I finally took a look at the CI...and you are right 😁
As a side note: even with pip, the |
@koaning another food for thought is that... all these transformers are in a |
@FBruzzesi any objections to moving back to I made an issue for that v1.5rc01 test failure on the sklearn side, feels like it might be breaking so upstream should know about it.
Totally! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am biased as I added a few commits myself 😂
Do we want to release early with a micro release by the way or would we prefer to make a bigger splash? There's something to be said for both, but given that we are introducing a dependency ... might be good to make a larger release? |
If everybody is fine with it, I could finish pandastransformers.py, or at least have a shot at ColumnSelector and see how it goes |
Just got a question about
No objections to waiting for a bigger release to include this, I'm just worried that it's going to become quite hard to review if everything is done in a single PR, or if PRs build on top of other open PRs May I suggest that you (i.e. a scikit-lego committer) make a |
@MarcoGorelli that's very reasonable! Here you have it: |
FWIW I think pushing to main is fine too. It's more of a "oh, we have one new feature, usually that means deploy". But we can also collect a bunch of things to main ... this is the main thing we're going to work on short-term. |
* placeholder to develop narwhals features * feat: make `ColumnDropper` dataframe-agnostic (#655) * feat: make ColumnDropped dataframe-agnostic * use narwhals[polars] in pyproject.toml, link to list of supported libraries * note that narwhals is used for cross-dataframe support * test refactor * docstrings --------- Co-authored-by: FBruzzesi <francesco.bruzzesi.93@gmail.com> * feat: make ColumnSelector dataframe-agnostic (#659) * columnselector with test rufformatted * adding whitespace * fixed the fit and transform * removed intendation in examples * font:false * feat: make `add_lags` dataframe-agnostic (#661) * make add_lags dataframe-agnostic * try getting tests to run? * patch: cvxpy 1.5.0 support (#663) --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * Make `RegressionOutlier` dataframe-agnostic (#665) * make regression outlier df-agnostic * need to use eager-only for this one * pass native to check_array * remove cudf, link to check_X_y * feat: Make InformationFilter dataframe-agnostic * Make Timegapsplit dataframe-agnostic (#668) * make timegapsplit dataframe-agnostic * actually, include cuDF * feat: make FairClassifier data-agnostic (#669) * start all over * fixture working * wip * passing tests - again * pre-commit complaining * changed fixture on test_demographic_parity * feat: Make PandasTypeSelector selector dataframe-agnostic (#670) * make pandas dtype selector df-agnostic * bump version * 3.8 compat * Update sklego/preprocessing/pandastransformers.py Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * fixup pyproject.toml * unify (and test!) error message * deprecate * update readme * undo contribution.md change --------- Co-authored-by: Francesco Bruzzesi <42817048+FBruzzesi@users.noreply.github.com> * format typeselector and bump version * feat: Make grouped and hierarchical dataframe-agnostic (#667) * feat: make grouped and hierarchical dataframe-agnostic * add pyarrow * narwhals grouped_transformer * grouped transformer eureka * hierarchical narwhalified * so close but so far * return series instead of DataFrame for y * grouped WIP * merge branch and fix grouped * future annotations * format * handling negative indices * solve conflicts * hacking C * fairness: change C values in tests --------- Co-authored-by: Marco Edward Gorelli <marcogorelli@protonmail.com> Co-authored-by: Magdalena Anopsy <74981211+anopsy@users.noreply.github.com> Co-authored-by: Dea María Léon <deamarialeon@gmail.com>
Hey 👋 I discussed a bit with Francesco, who discussed it with Vincent. So, I was keen to get the ball rolling
Description
Making a start towards dataframe-agnosticism. For now, this only makes
ColumnDropped
dataframe-agnostic. It requires adding an extremely lightweight dependecy (Narwhals), but the result is that computation can happen natively for pandas/Polars/modin/cuDF (and any other library which may want to become Narwhals-compliant), without any data conversion.pandas is still a required dependency, as it's required for the rest of scikit-lego. But, it doesn't seem that far-fetched to be able to make all of scikit-lego dataframe-agnostic. If you're open to this, we could do it in stages?
There is no impact on current users, other than that Narwhals would be a required dependency (though it's really lightweight, and without any dependencies itself, so there's no risk for conflicts)
Type of change
Checklist:
Not sure about some of these items so I haven't checked them.
Demo
pandas users, they can keep using
scikit-lego
, without needing Polars to be installedBut, Polars users can use this, without it doing any conversion to pandas. In particular, Polars LazyFrame is supported (and stays lazy, but here's I'm calling
.collect
to show you the output):