`col` expression to replace callables as far as possible #386

phofl · 2023-11-06T16:29:28Z

PySpark (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html) and Polars (https://pola-rs.github.io/polars/py-polars/html/reference/expressions/col.html) both have column expressions that enable users to use expressions instead of callable (FYI: we intend to add something similar in pandas). Short example:

df = from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1}))
df.groupby("a"). transform(lambda: x: x.b / x.c.sum())

This is terrible from an optimization perspective, since the lambda doesn't tell us anything about what's going on in there. We could drop columns "d" and "e", but we have no way of knowing this.

If we add a col expression like Expression, we can rewrite this as follows:

import dask_expr as dx

df = from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1}))
df.groupby("a").transform(dx.col("b") / dx.col("c").sum())

We could then look at the expression that replaces the callable and figure out how we can optimise our expression.

Not totally sure how the API should look like, since we have to inject the actual self.frame into the column expression dx.col("b") / dx.col("c").sum() inside of the transform step.

cc @mrocklin @rjzamora

The text was updated successfully, but these errors were encountered:

fjetter · 2023-11-06T16:58:21Z

I'm fine with this, particularly if pandas is introducing the same thing. The one concern I have is that this is still a little unfamiliar for people.
FWIW pyarrow.compute is doing the same thing as well (with pc.field / https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html)

rjzamora · 2023-11-06T17:26:55Z

Do you think a similar approach can be used to delay "partition freezing"?

For example, maybe dx.npartitions(ref) can be used to mean: "use the number of partitions in the "target" (ref) collection/Expr.

phofl · 2023-11-06T17:28:27Z

Yep I think that is a good idea, I think in this case it will end up being equivalent to a callable, but formalizing this makes sense

gwvr · 2023-11-29T18:13:49Z

Column expressions would also enable much cleaner use of df.assign, when assigning new columns that reference columns within the df.

With pandas, currently you can choose between readability:

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
df = df.assign(f = df['a'] + df['b'])

and method chaining compatibility:

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
    .assign(f = lambda x: x['a'] + x['b'])
)

Another option is to .pipe to a function that performs the assignment and returns the dataframe, but that doesn't aid readability and I wouldn't expect it to play nice with dask.

Column expressions would allow something like:

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
    .assign(f = dx.col("a") + dx.col("b"))
)

I think this is a lot easier to grok than a lambda function.

phofl mentioned this issue Nov 21, 2023

Add TPC-H query 8 for Dask coiled/benchmarks#1180

Merged

phofl mentioned this issue Nov 29, 2023

No query or describe - call out common but missing methods? #440

Closed

phofl mentioned this issue Jan 29, 2024

Don't use lambdas in query 9 coiled/benchmarks#1296

Closed

fjetter mentioned this issue May 22, 2024

Fair dataframe API vs API vs SQL benchmarking. coiled/benchmarks#1515

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`col` expression to replace callables as far as possible #386

`col` expression to replace callables as far as possible #386

phofl commented Nov 6, 2023

fjetter commented Nov 6, 2023

rjzamora commented Nov 6, 2023

phofl commented Nov 6, 2023

gwvr commented Nov 29, 2023

col expression to replace callables as far as possible #386

col expression to replace callables as far as possible #386

Comments

phofl commented Nov 6, 2023

fjetter commented Nov 6, 2023

rjzamora commented Nov 6, 2023

phofl commented Nov 6, 2023

gwvr commented Nov 29, 2023

`col` expression to replace callables as far as possible #386

`col` expression to replace callables as far as possible #386