Skip to content

Question: Lazy vs eager APIs #195

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jaychia opened this issue Jul 6, 2023 · 5 comments
Closed

Question: Lazy vs eager APIs #195

jaychia opened this issue Jul 6, 2023 · 5 comments

Comments

@jaychia
Copy link

jaychia commented Jul 6, 2023

Hi! I am one of the maintainers of the https://github.com/Eventual-Inc/Daft dataframe library.

Daft is both distributed and lazy - I'm wondering if you foresee any complications because of this?

Having read through the proposed API, it seems that a user's flow might be as follows:

  1. Create a Daft dataframe (e.g. df.read_parquet(...))
  2. Cast to dataframe-API (standard_df = df. __dataframe_standard__ ())
  3. Call a bunch of standard APIs (standard_df = standard_df.insert(...))
  4. Now cast back to Daft dataframe for execution (df = standard_df.dataframe.collect())

Additionally, Daft also has a slightly different concept of a "column". Similar to PySpark and Polars, we have the concept of Expressions, but for example calling: col("x").all() would likely not return a Boolean (as expected by the column API which is eagerly executed), and instead it would return a new "scalar value" column object.

Is this the intention of the design? I'm curious how this would work in lazy/distributed dataframes such as Daft which have very different considerations in terms of when we want to run operations and when we want to collect results back to a the client machine, both of which are potentially very expensive operations.

@kkraus14
Copy link
Collaborator

kkraus14 commented Jul 7, 2023

Similar to PySpark and Polars, we have the concept of Expressions, but for example calling: col("x").all() would likely not return a Boolean (as expected by the column API which is eagerly executed), and instead it would return a new "scalar value" column object.

The idea is that libraries should be able to implement a duck-typed Boolean which would allow libraries to return their own "scalar value" objects that could be lazy, on the GPU, or specially implemented in any other way relevant for the implementing library, but would guarantee the ability to use them as normal Python scalar values as needed.

Is this the intention of the design? I'm curious how this would work in lazy/distributed dataframes such as Daft which have very different considerations in terms of when we want to run operations and when we want to collect results back to a the client machine, both of which are potentially very expensive operations.

A lot of discussion and thought has been put towards lazy / asynchronous execution and making sure the API that is being built out doesn't force unnecessary or inefficient synchronization / materialization. If there's places that are going against that spirit we should generally treat them as mistakes and work towards correcting them.

@jaychia
Copy link
Author

jaychia commented Jul 7, 2023

Makes sense and good to know, thanks @kkraus14 for the clarification!

I did just find https://data-apis.org/dataframe-api/draft/purpose_and_scope.html#out-of-scope which mentions that the following should not be baked into the API.

Expectations on when the execution is happening (in an eager or lazy way)

It would be exciting if adopting such a standard can help us integrate easier with downstream tooling (e.g. matplotlib, sklearn etc).

@MarcoGorelli
Copy link
Contributor

thanks @jaychia for getting in contact

I wasn't familiar with Daft, but it looks great, and I'm glad you're considering implementing the Standard!

Would you be interested in helping to shape the future direction? If so, it would be great to have someone from Daft take part

It would be exciting if adopting such a standard can help us integrate easier with downstream tooling (e.g. matplotlib, sklearn etc).

that's the hope 🤞 don't know about matplotlib specifically, but sklearn seem very supportive, and one of their devs is trying this out

@jaychia
Copy link
Author

jaychia commented Jul 7, 2023

We'll definitely keep an eye on the project! We're a pretty small team, so dedicating people to building and maintaining this standard is quite a big hit to our project's velocity.

However, if there are significant benefits, then it would make sense to invest the resources or at least build some experimental code to enable downstream libraries such as sklearn to start experimenting with the API.

@MarcoGorelli
Copy link
Contributor

thanks for the issue

in the end, there's no separate eager and lazy apis (separating them was suggested, but then walked back on), but there is a persist method

we also now have a Scalar class in the standard, so that return values from calls like the one you showed can stay lazy or on gpu, depending on the implementation

I think everything here's been addressed then, so closing, but please do let me know if I've missed something or would like something else addressed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants