-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to note that some methods aren't available for lazy dataframes? #224
Comments
There was a discussion in the Array API repo a while back that was somewhat similar. It was about things like From the point of view of someone using the dataframe API to write a tool that consumes dataframes (lazy or eager) you don't want to have to distinguish in your code whether a dataframe is lazy or not. It makes me think that if a user wants to know the shape or len, then they want to know the shape/len now. So a lazy dataframe would have to go away and evaluate enough to be able to answer that query. A bit like with |
@MarcoGorelli what does "don't work" mean? Does it force evaluation, raise an Exception, or ....? I think either choice is valid - e.g., Dask will force evaluation when needed, while the comment @betatim linked to had an implementation that needed to be 100% lazy and hence had to raise. Also, I think |
xref gh-195 for other lazy-implementation-specific questions |
Apologies, by "doesn't work" I meant that it raises an exception I think I'd be -1 on forcing computation for So the alternatives would be:
For the latter, what would the lazy object be? A lazy tuple? Such a thing doesn't exist in polars. If it's something which exists only in the standard, then presumably there needs to be some way to materialise it? Anyway, I think this highlights that there's need for discussion on this topic. But that's what we're here for! Fortunately it's in the agenda for tonight's call I think my general stance is:
|
It's not hard to build though, so I don't think there's a fundamental problem there. Polars can add it if desired.
The standard does not have anything that's specified as being lazy. It only has a tuple return type, with the understanding that it may be any other object that duck types as a tuple. Materialization is library-specific. I'd expect that if a Polars dataframe has a |
Sure - and if polars didn't want to add |
Alternatively for things like shape and length which are data dependent, we could do something like allow returning NULL or -1 or something to indicate that the value is unknown? For object equality comparisons I'm not sure they're even a part of the standard yet? But in general for APIs that return Scalars that could possibly be used for control flow what would be the experience for a lazy implementations if only an explicit API can force materialization? I.e. what would the experience of doing something like |
there is dataframe-api/spec/API_specification/dataframe_api/dataframe_object.py Lines 283 to 300 in b03bb4f
|
Yeah it's tricky. I really think this needs discussion and design Staying with polars as an example - indeed, what would
But crucially, there doesn't exist a lazy scalar in polars, and I don't think there would be much appetite for adding one. We could work around this by returning a This is why I'm suggesting that some methods raise |
Comment for reference:
|
This opens up a whole can of worms if we want to have lazy versions of future-materializable items. Lazy scalars? Does |
In polars that would raise (as it would in the dataframe standard - which, just for reference, is the repo this issue is in, just in case you came here by accident 😄 And if not, welcome, good to have you!) Today's call touched on a lot of topics, including forced materialisation in operations used in control flow, user-experience, methods like There wasn't a concrete conclusion (other than that this needs more discussion), but we did say we want to punt on it and get it working for eager libraries I'd suggest that for now we at least document what might methods might have implementation-dependent behaviour for lazy engines, so that if you want to write library-agnostic code, then as long as:
then you know your code will run fine |
Something I did want to share, with regards to forced materialisation, was this post by Liam Brannigan: https://www.linkedin.com/feed/update/urn:li:activity:7079836009755996160?utm_source=share&utm_medium=member_desktop
By not implementing something in lazy mode, polars forced him to rethink his approach, and to write more efficient code. If, conversely, polars had materialised the data under-the-hood for him, then he would have ended up write less-efficient-than-possible code |
I agree that Python "forced our hand" for bool and co. However, I'd still have preferred the current outcome even if Python hadn't forced us. The reason being that I think being able to write array/dataframe consuming code without having to differentiate lazy/eager (in control flow) is a big bonus. If you can get it. Maybe |
First, thanks all for the productive discussions, both here and on yesterday's call 🙌 I'll add some thoughts, and summarise/re-iterate my position. Update on equality comparisons in polars-lazy: they're explicitly forbidden and now raise an exception: pola-rs/polars#10274 So, where do we go with the polars implementation of the standard? Possible solutions which come to mind are:
To expand on why I'm currently against option 1: First, it goes against Polars' very-well-thought-out design On the other hand, both options 2 and 3 seem like a clear improvement over the current status-quo, so I'd welcome them I'm open to changing my mind though, happy to admit that I might be wrong |
Part of the Polars design philosophy is that you shouldn't accidentally trigger expensive computations. Calling That's where Any consumer - be it a user or a library - should definitely care to distinguish whether a dataframe is lazy or not. If the functionality you are writing requires a materialized dataframe, I would suggest not accepting lazy dataframes as input. |
Pointing out that the same is true for the width of a LazyFrame: perhaps we should allow that? |
Thanks both for your inputs! Regarding number of columns - yes, and that's already allowed:
Yup, totally agree - I'm all for allowing users to do costly things, so long as they intentionally opt-in to them |
At the dataframe summit, and I've chatted about this with some people from other libraries. Looks like Dask actually raises here, rather than triggering computation: In [1]: import dask.dataframe as dd
...: # from dask.datasets import timeseries
...:
...: pdf = pd.DataFrame({"x": [1, 2, 3], "y": 1})
...: df = dd.from_pandas(pdf, npartitions=2)
...: if (df.x.mean() > 0):
...: # do something
...: pass
...: else:
...: # do something else
...: pass
...:
---------------------------------------------------------------------------
TypeError: Trying to convert dd.Scalar<gt-2f5a..., dtype=bool> to a boolean value. Because Dask objects are lazily evaluated, they cannot be converted to a boolean value or used in boolean conditions like if statements. Try calling .compute() to force computation prior to converting to a boolean value or using in a conditional statement. They also advise calling So, for compatibility with both dask.dataframe and polars (lazy), I'm renewing my suggesting of not forcing computation on |
We are working on building a lazy data frame library based on ONNX in parallel to the lazy array library mentioned in data-apis/array-api#642. The point of ONNX is to create and serialize a computational graph ahead of time; i.e. before any eager values are at all available. It is therefore impossible for our use case to ever eagerly compute anything even if we wanted to. Using our library, it would be reasonable (albeit a bit contrived) to create a computational graph (and export it to an ONNX model) that ultimately outputs the shape of a data frame. The same would apply to other scalar values such as The essence of this comment is that we need the abstraction of a lazy/duck-typed shape-tuples and scalars in the standard in order to cover our use case. I think many of the points raised in data-apis/array-api#642 carry over to this discussion including the point that |
Thanks all for discussions Can we agree that the dataframe api should be kind of like a "zero-cost-abstraction"? In the sense that it if you can write something using the standard api, then it should be just as performant as if you'd used the original library. Because I don't think the current design of the standard api allows that. Here's an example I have in mind: def my_agnostic_plotting_function(df):
df = df.__dataframe_consortium_standard__()
df = (
big_computation_1(df)
.big_computation_2(df)
.big_computation_3(df)
)
plot_df(df) If I was writing a library which accepts a polars DataFrame or LazyFrame, I would write it as follows: def my_polars_plotting_function(df):
if isinstance(df, pl.DataFrame):
df = df.lazy()
df = (
big_computation_1(df)
.big_computation_2(df)
.big_computation_3(df)
.collect()
)
plot_df(df) which would be more performant. This is because it would still use query optimisation for the chaining of On the other hand, if def my_agnostic_function(df):
df = df.__dataframe_consortium_standard__()
df = (
df.lazy()
.big_computation_1(df)
.big_computation_2(df)
.big_computation_3(df)
.collect()
)
plot_df(df) I discussed |
I think as a dataframe consuming library (like scikit-learn) it would be annoying to have to write different code for lazy and eager dataframes. So the thing I care about is being able to write code that works with both. I have no (educated) opinion of whether this means an eager dataframe should behave like a lazy one (aka you call For my education: If all dataframes behave as if they were lazy, why do we need the |
Thanks for explaining In that case, because the user might pass an eager dataframe to your library. If you wanted to do several computations on it, it would be more efficient to do them lazily. So you either:
The other discussion point was what to do about lazy columns / column reductions. I have a suggestion here which aims to address that: #229 |
My expectation would be that df = df.__dataframe_consortium_standard__()
df = df.lazy() I'd always put these two lines together, so might as well combine them |
Hmm, I do like the sound of that! And |
We may have a way forwards - or at least, a proposal. See #249 |
I'll just raise in the pandas/polars implementations, no big deal - alternative suggestions are welcome |
Here are some examples which
don't workraise with polars lazyframes:Column.__len__
(same as above)but there may be a way around this)I'd like to think we can work round points
3 and4. But not 1 and 2, not really sure how that could workI know we want everything to be independent of the execution engine, but how do we plan for these to work for lazy engines?
The text was updated successfully, but these errors were encountered: