Add dataframe dispatch #888

jeromedockes · 2024-02-14T09:37:53Z

The goal of this PR is to make it easy to support both polars and pandas in skrub.

To make a function compatible with both polars and pandas, we use the dispatch
decorator. The function then has an attribute specialize which we can use to
register implementations for polars or for pandas (or for other backends we may
add in the future).

Compared to the current approach of having a _pandas and a _polars module and a function _get_df_namespace which returns the module
corresponding to a dataframe, this has several advantages:

the callers don't need to know whether a function is explicitly dispatched or
not in order to know where to import it from and how to call it. This allows
to change how a function is implemented (eg introduce the dataframe API)
without changing all call sites.
specializations may be defined in any skrub module, so all skrub modules are
not forced to put their helpers that need to be dispatched in the _pandas
and _polars modules. Functions can be grouped by functionality (as usual)
rather than by backend.
a function and its specializations can be found next to each other.

>>> import pandas as pd

>>> from skrub._dispatch import dispatch
>>> @dispatch
... def drop_nulls(column):
...     raise NotImplementedError()

We can now register specializations for pandas and polars

>>> @drop_nulls.specialize("pandas")
... def _drop_nulls_pandas(column):
...     return column.dropna()


>>> @drop_nulls.specialize("polars")
... def _drop_nulls_polars(column):
...     return column.drop_nulls()


>>> df = pd.DataFrame(dict(A=[0, 1, None, 3]))
>>> df
     A
0  0.0
1  1.0
2  NaN
3  3.0
>>> drop_nulls(df)
     A
0  0.0
1  1.0
3  3.0
>>> import polars as pl
>>> polars_df = pl.from_pandas(df)
>>> polars_df
shape: (4, 1)
┌──────┐
│ A    │
│ ---  │
│ f64  │
╞══════╡
│ 0.0  │
│ 1.0  │
│ null │
│ 3.0  │
└──────┘
>>> drop_nulls(polars_df)
shape: (3, 1)
┌─────┐
│ A   │
│ --- │
│ f64 │
╞═════╡
│ 0.0 │
│ 1.0 │
│ 3.0 │
└─────┘

It is also possible to define a specialization specifically for dataframes or
for series.

>>> @dispatch
... def f(obj):
...     pass

>>> @f.specialize("pandas", "DataFrame")
... def _(df):
...     print("DataFrame")

>>> @f.specialize("pandas", "Column")
... def _(df):
...     print("Column")

>>> f(pd.DataFrame())
DataFrame
>>> f(pd.Series([0]))
Column

jeromedockes · 2024-02-14T12:31:31Z

see more details in the dispatch module docstring

MarcoGorelli · 2024-02-15T21:01:02Z

hey - so, in #786 (comment) I was asked for my thoughts on this

first reaction to the dispatch mechanism - well done! this looks broadly useful, beyond skrub, perhaps it could be its own separate package? a lot of libraries are now trying to support both pandas and polars and so a community of people interested in being able to reuse code may well arise?

jeromedockes · 2024-02-16T11:15:36Z

first reaction to the dispatch mechanism - well done! this looks broadly useful, beyond skrub, perhaps it could be its own separate package? a lot of libraries are now trying to support both pandas and polars and so a community of people interested in being able to reuse code may well arise?

Thanks a lot for having a look! Your suggestion makes a lot of sense, but then I wonder if the scope would overlap with the dataframe-api-compat package... I guess we could start using the dispatch internally in skrub and see how it plays out in practice. And then, if we find it convenient to work with, consider moving it out?

MarcoGorelli · 2024-02-16T11:40:52Z

I'm tempted to repurpose what I'd worked on as a polars-api-compat, so you can write pandas/polars/cudf/modin-agnostic code with a subset of the Polars API. Maybe this would be useful to you? If we don't make any claims about trying to be make a Standard or anything like that, but just compatibility with the Polars API, then I don't think we would be stepping on the Consortium's feet. They've explicitly and repeatedly rejected the Polars-like expressions API anyway, this is clearly something they don't want, nothing to stop us from doing what would be useful to us.

The API I had in mind was:

def my_agnostic_function(df):
    dfx, plx = polars_api_compat.convert(df, api_version="0.20")

    # use some stable subset of Polars API, with `plx` as namespace
    result = (
        dfx.filter(plx.col("l_shipdate") <= var_1)
        .group_by("l_returnflag", "l_linestatus")
        .agg(
            plx.sum("l_quantity").alias("sum_qty"),
            plx.sum("l_extendedprice").alias("sum_base_price"),
            (plx.col("l_extendedprice") * (1 - plx.col("l_discount")))
            .sum()
            .alias("sum_disc_price"),
            (
                plx.col("l_extendedprice")
                * (1.0 - plx.col("l_discount"))
                * (1.0 + plx.col("l_tax"))
            )
            .sum()
            .alias("sum_charge"),
            plx.mean("l_quantity").alias("avg_qty"),
            plx.mean("l_extendedprice").alias("avg_price"),
            plx.mean("l_discount").alias("avg_disc"),
            plx.len().alias("count_order"),
        )
        .sort("l_returnflag", "l_linestatus")
    ).collect()

    # return result in original dataframe class
    return result.dataframe


my_agnostic_function(pandas.read_parquet("lineitem.parquet"))  # works, returns pandas dataframe
my_agnostic_function(polars.scan_parquet("lineitem.parquet"))  # works, returns polars dataframe

polars_api_compat would be a lightweight pure-Python package with no dependencies, it would just wrap pandas / cudf / modin (and in theory any other package) with a subset of the Polars API

I guess we could start using the dispatch internally in skrub and see how it plays out in practice. And then, if we find it convenient to work with, consider moving it out?

agree - start with getting something working, add tests, and then consider generalising 👍

jeromedockes · 2024-02-16T16:52:28Z

I think that would be very useful, yes. Indeed a compatibility layer for a couple or a few packages might be achievable faster than a more general standard

machow · 2024-02-16T17:01:42Z

Hey! @MarcoGorelli pointed me here. I've been working on solutions this problem, and documented in a tool called databackend. It supports python singledispatch, without requiring the underlying library being dispatched on. (I've vendored it into a tool called Great Tables, and used it to create a polars/pandas function layer in great_tables._tbl_data.py).

I've been working on how to get good type hints, and think there's a pretty good solution (see this comment in this plum issue).

Happy to work more on this, but wanted to drop what I've got on this type of problem!

jeromedockes · 2024-02-19T12:04:00Z

Thanks for sharing this, @machow! I guess if we look at it from a distance we landed on somewhat similar solutions, with a decorator for defining generic functions and registering single dispatch implementations without needing to import the actual types, + a private module of generic helper functions defined with this dispatch mechanism. The ABC approach you use in databackend has the advantage of allowing explicit isinstance checks as an alternative to calling generic functions, and allows registering implementations by providing type hints rather than an argument to the decorator.

jeromedockes · 2024-02-19T12:14:35Z

In the short term for skrub, I suggest we move forward with the dispatch decorator and the few helper functions introduced in this PR and see how far that takes us. As this is an internal implementation detail, it should be easy to swap (or complement) this approach with a more complete external library of backend-agnostic dataframe functions in the future, whether it relies on generic functions / dispatch or on the dataframe_api_compat / polars_api_compat way of getting to the concrete function definitions

jeromedockes · 2024-02-19T12:26:20Z

One thing that will need to be adjusted is that as it stands this PR defines a few functions that accept and return "Columns", ie pandas or polars Series.
This will have to be adapted for lazy frames and we will have to decide how to pass around information about a single column in a lazy frame, eg a pair of (LazyFrame, column name), (LazyFrame, expression), a LazyFrame with just 1 column, … or if we shouldn't rely on the concept of a column at all.
Still I think I'd rather save that for another PR and improve polars support in skrub iteratively, one small improvement at a time.

skrub/_dispatch.py

MarcoGorelli · 2024-02-19T15:03:03Z

Totally agree! If you start with this and get it working, then it's going to be a lot easier for me to take a step back and ask "what's skrub doing, and what can we abstract into an easily vendorable reusable solution?"

glemaitre

A couple of thoughts regarding the dispatcher.

skrub/_dispatch.py

glemaitre · 2024-02-19T15:17:56Z

skrub/_dispatch.py

+
+def _load_dataframe_module_info(name):
+    # if the module is not installed, import errors are propagated
+    if name == "pandas":


I'm wondering if this is an overkill for now, but I would register the supported version into a dict-like class. This might be one step closer to register another backend a just write the specialize version.

(Maybe going towards over-engineering): We could something similar to scikit-learn with the set_output (https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py) by creating a manager: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py#L183-L197

How it looks like, we return a dictionary for the moment, so it means that we don't really need a protocol, but we could use a dataclass instead.

do you mean to allow dynamically registering backends? unlike for scikit-learn's set_output, here adding a backend involves defining many functions and adding them to the appropriate skrub modules so I don't think that is likely to happen dynamically. So I guess my instinct would be to keep things simple until we need something more, but I may have misunderstood what is the goal of what you suggest

do you mean to allow dynamically registering backends?

Yes it was what I meant. But as I said, probably some over-engineering for the moment. Maybe using a dataclass instead of dict could be better to express what we those should return.

skrub/_dispatch.py

glemaitre · 2024-02-19T15:27:29Z

skrub/_dispatch.py

+        return {
+            "module": pandas,
+            "types": {
+                "DataFrame": [pandas.DataFrame],


I would tend to have tuples instead of something mutable

ok! note the dict that contains it is mutable too, should I replace that with a MappingProxyType to make it less easily mutable, too?

the only function that access this dict is the dispatch function itself so it's relatively easy to make sure it doesn't modify the dict

dict that contains it is mutable too

It is where I was thinking of a dataclass indeed.

the only function that access this dict is the dispatch function itself so it's relatively easy to make sure it doesn't modify the dict

Yep I'm not to worry about being modified. But having an immutable show to the reader that this is not supposed to be modified.

sounds good! when something doesn't need to be hashable I tend to use tuples for more structured records (ie something I'd tend to unpack) and lists for sequences of arbitrary length (ie something I'd tend to iterate over) but I get your point about showing it's not supposed to be modified, I'll make the change

It is where I was thinking of a dataclass indeed.

the "types" dict doesn't need to have the same keys for all backends though (eg there is no LazyFrame in pandas), so do you mean something like

@dataclass class ModuleInfo: name: str types: dict[str, tuple[type]]

?

yes, it was what I had in mind.

glemaitre · 2024-02-19T15:29:58Z

skrub/_dispatch.py

+        if generic_type_names is None:
+            generic_type_names = list(module_info["types"].keys())
+        if isinstance(generic_type_names, str):
+            generic_type_names = [generic_type_names]


a tuple as well

glemaitre · 2024-02-19T15:30:26Z

skrub/tests/_test_dispatch.py

@@ -0,0 +1,40 @@
+import pytest


is there a reason to start the name of the file by _?

Is the file going to be discovered automatically by pytest?

no reason, that's a typo :)

GaelVaroquaux · 2024-02-19T15:38:49Z

Totally agree! If you start with this and get it working, then it's going to be a lot easier for me to take a step back and ask "what's skrub doing, and what can we abstract into an easily vendorable reusable solution?"

That's exactly what I hope will happen: we demonstrate something that works for us on a couple of dataframe models, and we call the community to see how it can abstract away and bring in new dataframe models. By the way, London is not very far from Paris, if/when we organize a sprint, we'll ping you :)

glemaitre · 2024-02-19T15:40:15Z

skrub/_dataframe/tests/test_common.py

+#
+
+
+def test_skrub_namespace(df_module):


I think that it would be worth to provide the location in a module docstring of this feature.

just to make sure I understood correctly: you mean state in test_common.__doc__ that df_module is a pytest fixture defined in conftest.py?

you mean state in test_common.doc that df_module is a pytest fixture defined in conftest.py?

Just on the top of the file, adding a multiline docstring with the info that you mentioned. "module docstring" was a bit vague.

jeromedockes · 2024-02-23T14:13:18Z

Would it not be more judicious to consider them as categorical data

so does this mean is_categorical should return True for bool columns?

and in the tablevectorizer we could always special-case bool columns and just cast them to float or something like that

GaelVaroquaux · 2024-02-23T14:15:12Z

and in the tablevectorizer we could always special-case bool columns and just cast them to float or something like that

I think that this would be nice

jeromedockes · 2024-02-23T16:49:41Z

as this is an internal addition (and is not used yet to add polars support in any of the skrub estimators) I don't think it needs a whatsnew entry

apart from that @glemaitre you can have another look

jeromedockes · 2024-02-23T16:49:58Z

should to_numeric output float32 by default?

glemaitre · 2024-02-23T17:30:22Z

should to_numeric output float32 by default?

You would tend to say no but there is maybe a consideration to have here. Are pandas and polars behaving the same?

jeromedockes · 2024-02-27T10:15:40Z

You would tend to say no but there is maybe a consideration to have here. Are pandas and polars behaving the same?

it depends on the input; when converting strings they output int64 or float64

glemaitre · 2024-02-27T17:25:37Z

I just found this issue: #870

If we want to go this way then, we can make np.float32 the primary data type.

glemaitre

LGTM.

glemaitre · 2024-02-28T17:42:10Z

@GaelVaroquaux do you want to have a final look or we are good to merge as-is?

GaelVaroquaux · 2024-03-03T20:19:07Z

Merged [1]#888 into main.

Whoohoo! Congratulations!

MarcoGorelli · 2024-04-12T12:21:57Z

Hey all - just wanted to raise that I'll likely be getting 3 months' funding to work on Narwhals with 2 interns 🥳

If this would be useful to you and there's anything you'd like to see in it, please do give me a shout - anything that'd be useful to open source projects is considered in-scope (so long as it's already in Polars and easy-enough to do in pandas). The current API reference is here: https://marcogorelli.github.io/narwhals/api-reference/. I think it covers most of what you have here?

I hope I'm not coming across as trying to "force" Narhwals onto you, I'm just raising this issue in case it may help you reduce your cross-dataframe maintenance and focus on skrub's main mission

By the way, London is not very far from Paris, if/when we organize a sprint, we'll ping you :)

I hope to be able to make it to PyData Paris, hopefully we can meet there!

jeromedockes · 2024-05-02T09:35:24Z

Thanks for letting us know @MarcoGorelli ! that's great news. Indeed it seems to cover most of the subset of polars API we might need and we should definitely consider relying on it if/when the small set of private skrub functions we've added is not enough, maintaining it becomes a burden, using module-level functions becomes annoying and we want methods of generic dataframe objects as offered by narwhals, or we want to use something that exactly matches the polars API

jeromedockes added 8 commits February 13, 2024 10:57

start adding dispatch and testing utils

73c8a3f

group and reorder dataframe utils

ddd7baa

start adding tests

e51b33c

ignore unrelated deprecation warning

5661b8c

to_numpy

6510433

Merge remote-tracking branch 'upstream/main' into add-dataframe-dispatch

221fe8f

add test dispatch

6946cfa

add docstring to dispatch module

38ad8eb

jeromedockes added 4 commits February 14, 2024 13:40

revert unrelated changes

0ce1259

improve tests

a04fffc

more tests

eff67f6

fix test

9a63a80

jeromedockes mentioned this pull request Feb 14, 2024

RFC, WIP: demo: use dataframe api in datetimeencoder #786

Closed

fix doctest

eabb278

glemaitre self-requested a review February 19, 2024 14:17

glemaitre reviewed Feb 19, 2024

View reviewed changes

skrub/_dispatch.py Outdated Show resolved Hide resolved

glemaitre reviewed Feb 19, 2024

View reviewed changes

skrub/_dispatch.py Outdated Show resolved Hide resolved

glemaitre reviewed Feb 19, 2024

View reviewed changes

jeromedockes added 7 commits February 23, 2024 15:19

bools are not numbers

57fdccb

more tests

3338676

more tests

840e18a

more tests

146f798

iter

92cbc4a

fix test

21df64e

remove pandas import guard & test tz

81c770d

glemaitre added the no changelog needed label Feb 23, 2024

glemaitre self-requested a review February 23, 2024 17:04

jeromedockes added 2 commits February 26, 2024 10:00

anydate -> any_date

d666cda

pandas is_string

56954f7

jeromedockes mentioned this pull request Feb 27, 2024

[WIP] Column-wise transformations on part of a dataframe #877

Closed

jeromedockes mentioned this pull request Feb 28, 2024

[FEAT] Add MultiAggJoiner, refactor AggJoiner #876

Merged

jeromedockes added 2 commits February 28, 2024 17:52

fix is_string for old pandas

8810fb5

add comment

77e291c

glemaitre approved these changes Feb 28, 2024

View reviewed changes

glemaitre merged commit e9e3e11 into skrub-data:main Mar 1, 2024
28 checks passed

jeromedockes mentioned this pull request Apr 22, 2024

Potential performance issue: .to_dict method slow in pandas below 2.2 #891

Closed

jeromedockes mentioned this pull request May 2, 2024

Add column-wise transforms & refactor TableVectorizer #902

Merged

Add dataframe dispatch #888

Add dataframe dispatch #888

Conversation

jeromedockes commented Feb 14, 2024

jeromedockes commented Feb 14, 2024

MarcoGorelli commented Feb 15, 2024

jeromedockes commented Feb 16, 2024 via email

MarcoGorelli commented Feb 16, 2024 • edited Loading

jeromedockes commented Feb 16, 2024

machow commented Feb 16, 2024 • edited Loading

jeromedockes commented Feb 19, 2024

jeromedockes commented Feb 19, 2024

jeromedockes commented Feb 19, 2024

MarcoGorelli commented Feb 19, 2024

glemaitre left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

GaelVaroquaux commented Feb 19, 2024 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeromedockes commented Feb 23, 2024

GaelVaroquaux commented Feb 23, 2024 via email

jeromedockes commented Feb 23, 2024

jeromedockes commented Feb 23, 2024

glemaitre commented Feb 23, 2024

jeromedockes commented Feb 27, 2024

glemaitre commented Feb 27, 2024

glemaitre left a comment

Choose a reason for hiding this comment

glemaitre commented Feb 28, 2024

GaelVaroquaux commented Mar 3, 2024 via email

MarcoGorelli commented Apr 12, 2024

jeromedockes commented May 2, 2024

MarcoGorelli commented Feb 16, 2024 •

edited

Loading

machow commented Feb 16, 2024 •

edited

Loading