I think the answer here is that the API is fully compatible with implementations that use lazy evaluation, and that it should not have specific syntax. We should probably document that

Support for lazy evaluation

We have to design all APIs to support lazy evaluation, and I think so far there's nothing in the draft API that's incompatible with that. It's basically the same as in the array API, quoting from here: The same applies to software environments: it must be possible to create an array library adhering to this standard that runs efficiently independent of what compilers, build-time or run-time execution environment, or distribution and install method is employed. Parallel execution, JIT compilation, and delayed (lazy) evaluation must all be possible.

As a design rule, the syntax and semantics of the API must be independent of execution model. Which by definition rules out something like .collect(). Execution of code written in the API backed by a lazy implementation can probably be triggered by (a) library-specific implementation details where the implementation must graph break for some reasons (full Python branching like on if-else not supported for example), (b) interop with another library, like when __dataframe__ is called, and (c) the user explicitly asking for an execution step with .collect() or similar.

Syntax variations between libraries for lazy execution

This is very non-uniform and would be hard to standardize anyway. E.g.:

Dask is always lazy and has a .compute() method to trigger execution
Vaex offers both eager and lazy execution (see here), uses .execute() to trigger and delay=True or @delayed to choose the lazy mode over the eager one
Polars also offers both eager and lazy (see here), and uses .collect() to trigger execution and .lazy() or lazy-specific I/O functions like pl.scan_csv to choose lazy mode.

How this should work

import a_dataframe_lib

# Create a dataframe and lazy mode, may be library-specific I/O or apply explicit syntax like @delayed
df = a_dataframe_lib.xxx
# Use standard API
....
df_out = ...

# Now we have, for example, `df_out` as the result of use of standard-compliant API
# This can now be materialized by something like:
df_out.compute()  # Dask-specific

DataFrame.collect() for lazy dataframes? #120

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions