Question: Lazy vs eager APIs #195

jaychia · 2023-07-06T22:07:29Z

Hi! I am one of the maintainers of the https://github.com/Eventual-Inc/Daft dataframe library.

Daft is both distributed and lazy - I'm wondering if you foresee any complications because of this?

Having read through the proposed API, it seems that a user's flow might be as follows:

Create a Daft dataframe (e.g. df.read_parquet(...))
Cast to dataframe-API (standard_df = df. __dataframe_standard__ ())
Call a bunch of standard APIs (standard_df = standard_df.insert(...))
Now cast back to Daft dataframe for execution (df = standard_df.dataframe.collect())

Additionally, Daft also has a slightly different concept of a "column". Similar to PySpark and Polars, we have the concept of Expressions, but for example calling: col("x").all() would likely not return a Boolean (as expected by the column API which is eagerly executed), and instead it would return a new "scalar value" column object.

Is this the intention of the design? I'm curious how this would work in lazy/distributed dataframes such as Daft which have very different considerations in terms of when we want to run operations and when we want to collect results back to a the client machine, both of which are potentially very expensive operations.

The text was updated successfully, but these errors were encountered:

kkraus14 · 2023-07-07T02:59:07Z

Similar to PySpark and Polars, we have the concept of Expressions, but for example calling: col("x").all() would likely not return a Boolean (as expected by the column API which is eagerly executed), and instead it would return a new "scalar value" column object.

The idea is that libraries should be able to implement a duck-typed Boolean which would allow libraries to return their own "scalar value" objects that could be lazy, on the GPU, or specially implemented in any other way relevant for the implementing library, but would guarantee the ability to use them as normal Python scalar values as needed.

Is this the intention of the design? I'm curious how this would work in lazy/distributed dataframes such as Daft which have very different considerations in terms of when we want to run operations and when we want to collect results back to a the client machine, both of which are potentially very expensive operations.

A lot of discussion and thought has been put towards lazy / asynchronous execution and making sure the API that is being built out doesn't force unnecessary or inefficient synchronization / materialization. If there's places that are going against that spirit we should generally treat them as mistakes and work towards correcting them.

jaychia · 2023-07-07T03:40:26Z

Makes sense and good to know, thanks @kkraus14 for the clarification!

I did just find https://data-apis.org/dataframe-api/draft/purpose_and_scope.html#out-of-scope which mentions that the following should not be baked into the API.

Expectations on when the execution is happening (in an eager or lazy way)

It would be exciting if adopting such a standard can help us integrate easier with downstream tooling (e.g. matplotlib, sklearn etc).

MarcoGorelli · 2023-07-07T16:55:26Z

thanks @jaychia for getting in contact

I wasn't familiar with Daft, but it looks great, and I'm glad you're considering implementing the Standard!

Would you be interested in helping to shape the future direction? If so, it would be great to have someone from Daft take part

It would be exciting if adopting such a standard can help us integrate easier with downstream tooling (e.g. matplotlib, sklearn etc).

that's the hope 🤞 don't know about matplotlib specifically, but sklearn seem very supportive, and one of their devs is trying this out

jaychia · 2023-07-07T17:50:12Z

We'll definitely keep an eye on the project! We're a pretty small team, so dedicating people to building and maintaining this standard is quite a big hit to our project's velocity.

However, if there are significant benefits, then it would make sense to invest the resources or at least build some experimental code to enable downstream libraries such as sklearn to start experimenting with the API.

MarcoGorelli · 2023-12-15T11:05:28Z

thanks for the issue

in the end, there's no separate eager and lazy apis (separating them was suggested, but then walked back on), but there is a persist method

we also now have a Scalar class in the standard, so that return values from calls like the one you showed can stay lazy or on gpu, depending on the implementation

I think everything here's been addressed then, so closing, but please do let me know if I've missed something or would like something else addressed

rgommers added the topic: lazy/graph label Aug 3, 2023

rgommers mentioned this issue Aug 3, 2023

How to note that some methods aren't available for lazy dataframes? #224

Closed

MarcoGorelli closed this as completed Dec 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Lazy vs eager APIs #195

Question: Lazy vs eager APIs #195

jaychia commented Jul 6, 2023

kkraus14 commented Jul 7, 2023

jaychia commented Jul 7, 2023

MarcoGorelli commented Jul 7, 2023

jaychia commented Jul 7, 2023

MarcoGorelli commented Dec 15, 2023

Question: Lazy vs eager APIs #195

Question: Lazy vs eager APIs #195

Comments

jaychia commented Jul 6, 2023

kkraus14 commented Jul 7, 2023

jaychia commented Jul 7, 2023

MarcoGorelli commented Jul 7, 2023

jaychia commented Jul 7, 2023

MarcoGorelli commented Dec 15, 2023