Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

col expression to replace callables as far as possible #386

Open
phofl opened this issue Nov 6, 2023 · 4 comments
Open

col expression to replace callables as far as possible #386

phofl opened this issue Nov 6, 2023 · 4 comments

Comments

@phofl
Copy link
Collaborator

phofl commented Nov 6, 2023

PySpark (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html) and Polars (https://pola-rs.github.io/polars/py-polars/html/reference/expressions/col.html) both have column expressions that enable users to use expressions instead of callable (FYI: we intend to add something similar in pandas). Short example:

df = from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1}))
df.groupby("a"). transform(lambda: x: x.b / x.c.sum())

This is terrible from an optimization perspective, since the lambda doesn't tell us anything about what's going on in there. We could drop columns "d" and "e", but we have no way of knowing this.

If we add a col expression like Expression, we can rewrite this as follows:

import dask_expr as dx

df = from_pandas(pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1}))
df.groupby("a").transform(dx.col("b") / dx.col("c").sum())

We could then look at the expression that replaces the callable and figure out how we can optimise our expression.

Not totally sure how the API should look like, since we have to inject the actual self.frame into the column expression dx.col("b") / dx.col("c").sum() inside of the transform step.

cc @mrocklin @rjzamora

@fjetter
Copy link
Member

fjetter commented Nov 6, 2023

I'm fine with this, particularly if pandas is introducing the same thing. The one concern I have is that this is still a little unfamiliar for people.
FWIW pyarrow.compute is doing the same thing as well (with pc.field / https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Expression.html)

@rjzamora
Copy link
Member

rjzamora commented Nov 6, 2023

Do you think a similar approach can be used to delay "partition freezing"?

For example, maybe dx.npartitions(ref) can be used to mean: "use the number of partitions in the "target" (ref) collection/Expr.

@phofl
Copy link
Collaborator Author

phofl commented Nov 6, 2023

Yep I think that is a good idea, I think in this case it will end up being equivalent to a callable, but formalizing this makes sense

@gwvr
Copy link

gwvr commented Nov 29, 2023

Column expressions would also enable much cleaner use of df.assign, when assigning new columns that reference columns within the df.

With pandas, currently you can choose between readability:

df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
df = df.assign(f = df['a'] + df['b'])

and method chaining compatibility:

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
    .assign(f = lambda x: x['a'] + x['b'])
)

Another option is to .pipe to a function that performs the assignment and returns the dataframe, but that doesn't aid readability and I wouldn't expect it to play nice with dask.

Column expressions would allow something like:

df = (
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
    .assign(f = dx.col("a") + dx.col("b"))
)

I think this is a lot easier to grok than a lambda function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants