-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataframe dispatch #888
Add dataframe dispatch #888
Conversation
see more details in the dispatch module docstring |
hey - so, in #786 (comment) I was asked for my thoughts on this first reaction to the dispatch mechanism - well done! this looks broadly useful, beyond skrub, perhaps it could be its own separate package? a lot of libraries are now trying to support both pandas and polars and so a community of people interested in being able to reuse code may well arise? |
first reaction to the dispatch mechanism - well done! this looks broadly useful, beyond skrub, perhaps it could be its own separate package? a lot of libraries are now trying to support both pandas and polars and so a community of people interested in being able to reuse code may well arise?
Thanks a lot for having a look! Your suggestion makes a lot of sense, but then I wonder if the scope would overlap with the dataframe-api-compat package... I guess we could start using the dispatch internally in skrub and see how it plays out in practice. And then, if we find it convenient to work with, consider moving it out?
|
I'm tempted to repurpose what I'd worked on as a The API I had in mind was: def my_agnostic_function(df):
dfx, plx = polars_api_compat.convert(df, api_version="0.20")
# use some stable subset of Polars API, with `plx` as namespace
result = (
dfx.filter(plx.col("l_shipdate") <= var_1)
.group_by("l_returnflag", "l_linestatus")
.agg(
plx.sum("l_quantity").alias("sum_qty"),
plx.sum("l_extendedprice").alias("sum_base_price"),
(plx.col("l_extendedprice") * (1 - plx.col("l_discount")))
.sum()
.alias("sum_disc_price"),
(
plx.col("l_extendedprice")
* (1.0 - plx.col("l_discount"))
* (1.0 + plx.col("l_tax"))
)
.sum()
.alias("sum_charge"),
plx.mean("l_quantity").alias("avg_qty"),
plx.mean("l_extendedprice").alias("avg_price"),
plx.mean("l_discount").alias("avg_disc"),
plx.len().alias("count_order"),
)
.sort("l_returnflag", "l_linestatus")
).collect()
# return result in original dataframe class
return result.dataframe
my_agnostic_function(pandas.read_parquet("lineitem.parquet")) # works, returns pandas dataframe
my_agnostic_function(polars.scan_parquet("lineitem.parquet")) # works, returns polars dataframe
agree - start with getting something working, add tests, and then consider generalising 👍 |
I think that would be very useful, yes. Indeed a compatibility layer for a couple or a few packages might be achievable faster than a more general standard |
Hey! @MarcoGorelli pointed me here. I've been working on solutions this problem, and documented in a tool called databackend. It supports python singledispatch, without requiring the underlying library being dispatched on. (I've vendored it into a tool called Great Tables, and used it to create a polars/pandas function layer in great_tables._tbl_data.py). I've been working on how to get good type hints, and think there's a pretty good solution (see this comment in this plum issue). Happy to work more on this, but wanted to drop what I've got on this type of problem! |
Thanks for sharing this, @machow! I guess if we look at it from a distance we landed on somewhat similar solutions, with a decorator for defining generic functions and registering single dispatch implementations without needing to import the actual types, + a private module of generic helper functions defined with this dispatch mechanism. The |
In the short term for skrub, I suggest we move forward with the |
One thing that will need to be adjusted is that as it stands this PR defines a few functions that accept and return "Columns", ie pandas or polars Series. |
Totally agree! If you start with this and get it working, then it's going to be a lot easier for me to take a step back and ask "what's skrub doing, and what can we abstract into an easily vendorable reusable solution?" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of thoughts regarding the dispatcher.
|
||
def _load_dataframe_module_info(name): | ||
# if the module is not installed, import errors are propagated | ||
if name == "pandas": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this is an overkill for now, but I would register the supported version into a dict
-like class. This might be one step closer to register another backend a just write the specialize version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Maybe going towards over-engineering): We could something similar to scikit-learn with the set_output
(https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py) by creating a manager: https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/utils/_set_output.py#L183-L197
How it looks like, we return a dictionary for the moment, so it means that we don't really need a protocol, but we could use a dataclass instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to allow dynamically registering backends? unlike for scikit-learn's set_output, here adding a backend involves defining many functions and adding them to the appropriate skrub modules so I don't think that is likely to happen dynamically. So I guess my instinct would be to keep things simple until we need something more, but I may have misunderstood what is the goal of what you suggest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean to allow dynamically registering backends?
Yes it was what I meant. But as I said, probably some over-engineering for the moment. Maybe using a dataclass
instead of dict could be better to express what we those should return.
skrub/_dispatch.py
Outdated
return { | ||
"module": pandas, | ||
"types": { | ||
"DataFrame": [pandas.DataFrame], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would tend to have tuples instead of something mutable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok! note the dict that contains it is mutable too, should I replace that with a MappingProxyType
to make it less easily mutable, too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only function that access this dict is the dispatch
function itself so it's relatively easy to make sure it doesn't modify the dict
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dict that contains it is mutable too
It is where I was thinking of a dataclass indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only function that access this dict is the dispatch function itself so it's relatively easy to make sure it doesn't modify the dict
Yep I'm not to worry about being modified. But having an immutable show to the reader that this is not supposed to be modified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good! when something doesn't need to be hashable I tend to use tuples for more structured records (ie something I'd tend to unpack) and lists for sequences of arbitrary length (ie something I'd tend to iterate over) but I get your point about showing it's not supposed to be modified, I'll make the change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is where I was thinking of a dataclass indeed.
the "types" dict doesn't need to have the same keys for all backends though (eg there is no LazyFrame in pandas), so do you mean something like
@dataclass
class ModuleInfo:
name: str
types: dict[str, tuple[type]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it was what I had in mind.
skrub/_dispatch.py
Outdated
if generic_type_names is None: | ||
generic_type_names = list(module_info["types"].keys()) | ||
if isinstance(generic_type_names, str): | ||
generic_type_names = [generic_type_names] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a tuple as well
skrub/tests/_test_dispatch.py
Outdated
@@ -0,0 +1,40 @@ | |||
import pytest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason to start the name of the file by _
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the file going to be discovered automatically by pytest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no reason, that's a typo :)
Totally agree! If you start with this and get it working, then it's going to be a lot easier for me to take a step back and ask "what's skrub doing, and what can we abstract into an easily vendorable reusable solution?"
That's exactly what I hope will happen: we demonstrate something that works for us on a couple of dataframe models, and we call the community to see how it can abstract away and bring in new dataframe models.
By the way, London is not very far from Paris, if/when we organize a sprint, we'll ping you :)
|
# | ||
|
||
|
||
def test_skrub_namespace(df_module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that it would be worth to provide the location in a module docstring of this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to make sure I understood correctly: you mean state in test_common.__doc__
that df_module
is a pytest fixture defined in conftest.py
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you mean state in test_common.doc that df_module is a pytest fixture defined in conftest.py?
Just on the top of the file, adding a multiline docstring with the info that you mentioned. "module docstring" was a bit vague.
so does this mean and in the tablevectorizer we could always special-case bool columns and just cast them to float or something like that |
and in the tablevectorizer we could always special-case bool columns and just cast them to float or something like that
I think that this would be nice
|
as this is an internal addition (and is not used yet to add polars support in any of the skrub estimators) I don't think it needs a whatsnew entry apart from that @glemaitre you can have another look |
should |
You would tend to say no but there is maybe a consideration to have here. Are pandas and polars behaving the same? |
it depends on the input; when converting strings they output int64 or float64 |
I just found this issue: #870 If we want to go this way then, we can make |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
@GaelVaroquaux do you want to have a final look or we are good to merge as-is? |
Merged [1]#888 into main.
Whoohoo! Congratulations!
|
Hey all - just wanted to raise that I'll likely be getting 3 months' funding to work on Narwhals with 2 interns 🥳 If this would be useful to you and there's anything you'd like to see in it, please do give me a shout - anything that'd be useful to open source projects is considered in-scope (so long as it's already in Polars and easy-enough to do in pandas). The current API reference is here: https://marcogorelli.github.io/narwhals/api-reference/. I think it covers most of what you have here? I hope I'm not coming across as trying to "force" Narhwals onto you, I'm just raising this issue in case it may help you reduce your cross-dataframe maintenance and focus on skrub's main mission
I hope to be able to make it to PyData Paris, hopefully we can meet there! |
Thanks for letting us know @MarcoGorelli ! that's great news. Indeed it seems to cover most of the subset of polars API we might need and we should definitely consider relying on it if/when the small set of private skrub functions we've added is not enough, maintaining it becomes a burden, using module-level functions becomes annoying and we want methods of generic dataframe objects as offered by narwhals, or we want to use something that exactly matches the polars API |
The goal of this PR is to make it easy to support both polars and pandas in skrub.
To make a function compatible with both polars and pandas, we use the
dispatch
decorator. The function then has an attribute
specialize
which we can use toregister implementations for polars or for pandas (or for other backends we may
add in the future).
Compared to the current approach of having a
_pandas
and a_polars
module and a function_get_df_namespace
which returns the modulecorresponding to a dataframe, this has several advantages:
not in order to know where to import it from and how to call it. This allows
to change how a function is implemented (eg introduce the dataframe API)
without changing all call sites.
not forced to put their helpers that need to be dispatched in the
_pandas
and
_polars
modules. Functions can be grouped by functionality (as usual)rather than by backend.