-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
col
expression to replace callables as far as possible
#386
Comments
I'm fine with this, particularly if pandas is introducing the same thing. The one concern I have is that this is still a little unfamiliar for people. |
Do you think a similar approach can be used to delay "partition freezing"? For example, maybe |
Yep I think that is a good idea, I think in this case it will end up being equivalent to a callable, but formalizing this makes sense |
Column expressions would also enable much cleaner use of df.assign, when assigning new columns that reference columns within the df. With pandas, currently you can choose between readability: df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
df = df.assign(f = df['a'] + df['b']) and method chaining compatibility: df = (
pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
.assign(f = lambda x: x['a'] + x['b'])
) Another option is to Column expressions would allow something like: df = (
pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": 1, "d": 1, "e": 1})
.assign(f = dx.col("a") + dx.col("b"))
) I think this is a lot easier to grok than a lambda function. |
PySpark (https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.col.html) and Polars (https://pola-rs.github.io/polars/py-polars/html/reference/expressions/col.html) both have column expressions that enable users to use expressions instead of callable (FYI: we intend to add something similar in pandas). Short example:
This is terrible from an optimization perspective, since the lambda doesn't tell us anything about what's going on in there. We could drop columns "d" and "e", but we have no way of knowing this.
If we add a col expression like Expression, we can rewrite this as follows:
We could then look at the expression that replaces the callable and figure out how we can optimise our expression.
Not totally sure how the API should look like, since we have to inject the actual
self.frame
into the column expressiondx.col("b") / dx.col("c").sum()
inside of the transform step.cc @mrocklin @rjzamora
The text was updated successfully, but these errors were encountered: