-
Notifications
You must be signed in to change notification settings - Fork 21
Question: Lazy vs eager APIs #195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The idea is that libraries should be able to implement a duck-typed Boolean which would allow libraries to return their own "scalar value" objects that could be lazy, on the GPU, or specially implemented in any other way relevant for the implementing library, but would guarantee the ability to use them as normal Python scalar values as needed.
A lot of discussion and thought has been put towards lazy / asynchronous execution and making sure the API that is being built out doesn't force unnecessary or inefficient synchronization / materialization. If there's places that are going against that spirit we should generally treat them as mistakes and work towards correcting them. |
Makes sense and good to know, thanks @kkraus14 for the clarification! I did just find https://data-apis.org/dataframe-api/draft/purpose_and_scope.html#out-of-scope which mentions that the following should not be baked into the API.
It would be exciting if adopting such a standard can help us integrate easier with downstream tooling (e.g. matplotlib, sklearn etc). |
thanks @jaychia for getting in contact I wasn't familiar with Daft, but it looks great, and I'm glad you're considering implementing the Standard! Would you be interested in helping to shape the future direction? If so, it would be great to have someone from Daft take part
that's the hope 🤞 don't know about matplotlib specifically, but sklearn seem very supportive, and one of their devs is trying this out |
We'll definitely keep an eye on the project! We're a pretty small team, so dedicating people to building and maintaining this standard is quite a big hit to our project's velocity. However, if there are significant benefits, then it would make sense to invest the resources or at least build some experimental code to enable downstream libraries such as sklearn to start experimenting with the API. |
thanks for the issue in the end, there's no separate eager and lazy apis (separating them was suggested, but then walked back on), but there is a we also now have a I think everything here's been addressed then, so closing, but please do let me know if I've missed something or would like something else addressed |
Hi! I am one of the maintainers of the https://github.com/Eventual-Inc/Daft dataframe library.
Daft is both distributed and lazy - I'm wondering if you foresee any complications because of this?
Having read through the proposed API, it seems that a user's flow might be as follows:
df.read_parquet(...)
)standard_df = df. __dataframe_standard__ ()
)standard_df = standard_df.insert(...)
)df = standard_df.dataframe.collect()
)Additionally, Daft also has a slightly different concept of a "column". Similar to PySpark and Polars, we have the concept of Expressions, but for example calling:
col("x").all()
would likely not return a Boolean (as expected by the column API which is eagerly executed), and instead it would return a new "scalar value" column object.Is this the intention of the design? I'm curious how this would work in lazy/distributed dataframes such as Daft which have very different considerations in terms of when we want to run operations and when we want to collect results back to a the client machine, both of which are potentially very expensive operations.
The text was updated successfully, but these errors were encountered: