-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame.collect() for lazy dataframes? #120
Comments
I think the answer here is that the API is fully compatible with implementations that use lazy evaluation, and that it should not have specific syntax. We should probably document that Support for lazy evaluation We have to design all APIs to support lazy evaluation, and I think so far there's nothing in the draft API that's incompatible with that. It's basically the same as in the array API, quoting from here: The same applies to software environments: it must be possible to create an array library adhering to this standard that runs efficiently independent of what compilers, build-time or run-time execution environment, or distribution and install method is employed. Parallel execution, JIT compilation, and delayed (lazy) evaluation must all be possible. As a design rule, the syntax and semantics of the API must be independent of execution model. Which by definition rules out something like Syntax variations between libraries for lazy execution This is very non-uniform and would be hard to standardize anyway. E.g.:
How this should work import a_dataframe_lib
# Create a dataframe and lazy mode, may be library-specific I/O or apply explicit syntax like @delayed
df = a_dataframe_lib.xxx
# Use standard API
....
df_out = ...
# Now we have, for example, `df_out` as the result of use of standard-compliant API
# This can now be materialized by something like:
df_out.compute() # Dask-specific |
OK - thinking of seaborn as an example, it might:
By the time seaborn gets to the last step, the data would need to be materialised. But if we're decide that |
That sounds about right to me. |
Agreed, depending on the semantics of the |
thanks all, closing then |
re-opening as scikit-learn have mentioned that they would like to collect separately from converting to ndarray |
Can we have a specific use case here? Even a single code snippet would help. I don't think it's good practice to trigger |
Thanks, let's discuss tomorrow. I'm realising that there's far, far more to the topic of lazy dataframes, I'll add it to the agenda |
closing as this discussion has effectively moved to #224 |
Revisiting this, I think it's sufficiently different from #224 that it needs to stay open
I no longer stand by this, this would potentially result in double-computation, e.g.: mask = df.get_column_by_name('flag') == 'train'
x_train = df.get_rows_by_mask(mask)
x_test = df.get_rows_by_mask(~mask)
my_fancy_algorithm.fit(x_train.to_array_object(), x_test.to_array_object()) In fact, Dask devs have told me that their users sometimes accidentally trigger compute twice, because some of their methods do that under-the-hood for them. I'm against anything other than an explicit |
I'll just add this to the pandas/polars implementations, and then users can do
if they need to force materialisation Other suggestions are welcome - I tried with #249 but there wasn't support |
Some DataFrames have a lazy api, which the standard should probably support
Should we add a
collect
method, which for eager libraries would just returnself
and for lazy ones would materialise the dataframe?The text was updated successfully, but these errors were encountered: