Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.collect() for lazy dataframes? #120

Closed
MarcoGorelli opened this issue Mar 30, 2023 · 11 comments
Closed

DataFrame.collect() for lazy dataframes? #120

MarcoGorelli opened this issue Mar 30, 2023 · 11 comments

Comments

@MarcoGorelli
Copy link
Contributor

Some DataFrames have a lazy api, which the standard should probably support

Should we add a collect method, which for eager libraries would just return self and for lazy ones would materialise the dataframe?

@rgommers
Copy link
Member

I think the answer here is that the API is fully compatible with implementations that use lazy evaluation, and that it should not have specific syntax. We should probably document that

Support for lazy evaluation

We have to design all APIs to support lazy evaluation, and I think so far there's nothing in the draft API that's incompatible with that. It's basically the same as in the array API, quoting from here: The same applies to software environments: it must be possible to create an array library adhering to this standard that runs efficiently independent of what compilers, build-time or run-time execution environment, or distribution and install method is employed. Parallel execution, JIT compilation, and delayed (lazy) evaluation must all be possible.

As a design rule, the syntax and semantics of the API must be independent of execution model. Which by definition rules out something like .collect(). Execution of code written in the API backed by a lazy implementation can probably be triggered by (a) library-specific implementation details where the implementation must graph break for some reasons (full Python branching like on if-else not supported for example), (b) interop with another library, like when __dataframe__ is called, and (c) the user explicitly asking for an execution step with .collect() or similar.

Syntax variations between libraries for lazy execution

This is very non-uniform and would be hard to standardize anyway. E.g.:

  • Dask is always lazy and has a .compute() method to trigger execution
  • Vaex offers both eager and lazy execution (see here), uses .execute() to trigger and delay=True or @delayed to choose the lazy mode over the eager one
  • Polars also offers both eager and lazy (see here), and uses .collect() to trigger execution and .lazy() or lazy-specific I/O functions like pl.scan_csv to choose lazy mode.

How this should work

import a_dataframe_lib

# Create a dataframe and lazy mode, may be library-specific I/O or apply explicit syntax like @delayed
df = a_dataframe_lib.xxx
# Use standard API
....
df_out = ...

# Now we have, for example, `df_out` as the result of use of standard-compliant API
# This can now be materialized by something like:
df_out.compute()  # Dask-specific

@MarcoGorelli
Copy link
Contributor Author

OK - thinking of seaborn as an example, it might:

  • take a DataFrame df (which could be lazy or not)
  • do some operations, like df.groupby([col]).mean()
  • plot the data, like ax.plot(df.get_column_by_name('x').to_array(), df.get_column_by_name('y').to_array())

By the time seaborn gets to the last step, the data would need to be materialised. But if we're decide that to_array should return something which can be iterated over, then perhaps the call to .collect can be taken care of inside to_array and can be seen as an implementation detail

@rgommers
Copy link
Member

then perhaps the call to .collect can be taken care of inside to_array and can be seen as an implementation detail

That sounds about right to me.

@kkraus14
Copy link
Collaborator

Agreed, depending on the semantics of the to_array call and what guarantees it provides, if materialized data is required then the framework implementing the API would be responsible for handling that in its to_array implementation.

@MarcoGorelli
Copy link
Contributor Author

thanks all, closing then

@MarcoGorelli
Copy link
Contributor Author

re-opening as scikit-learn have mentioned that they would like to collect separately from converting to ndarray

@rgommers
Copy link
Member

rgommers commented Aug 2, 2023

re-opening as scikit-learn have mentioned that they would like to collect separately from converting to ndarray

Can we have a specific use case here? Even a single code snippet would help. I don't think it's good practice to trigger .collect() explicitly in some random place in another library. Lazy computations should stay lazy as long as possible.

@MarcoGorelli
Copy link
Contributor Author

Thanks, let's discuss tomorrow. I'm realising that there's far, far more to the topic of lazy dataframes, I'll add it to the agenda

@MarcoGorelli
Copy link
Contributor Author

closing as this discussion has effectively moved to #224

@MarcoGorelli
Copy link
Contributor Author

Revisiting this, I think it's sufficiently different from #224 that it needs to stay open

then perhaps the call to .collect can be taken care of inside to_array and can be seen as an implementation detail

I no longer stand by this, this would potentially result in double-computation, e.g.:

mask = df.get_column_by_name('flag') == 'train'
x_train = df.get_rows_by_mask(mask)
x_test = df.get_rows_by_mask(~mask)
my_fancy_algorithm.fit(x_train.to_array_object(), x_test.to_array_object())

In fact, Dask devs have told me that their users sometimes accidentally trigger compute twice, because some of their methods do that under-the-hood for them.

I'm against anything other than an explicit .collect (or compute) triggering computation. If collect can't make its way into the standard, then OK, so be it, lazy computations will stay lazy and eager computations will stay eager

@MarcoGorelli MarcoGorelli reopened this Aug 29, 2023
@MarcoGorelli
Copy link
Contributor Author

I'll just add this to the pandas/polars implementations, and then users can do

if hasattr(df, 'collect'):
    df = df.collect()

if they need to force materialisation

Other suggestions are welcome - I tried with #249 but there wasn't support

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants