-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade Datafusion 40 #771
Conversation
This required trait method was added upstream [0] and recommends to simply forward to `static_name`. [0]: apache/datafusion#10266
Upstream signatures were changed for the new new `AggregateBuilder` api [0]. This simply gets the code to work. We should better incorporate that API into `datafusion-python`. [0] apache/datafusion#10560
Builtin Count was removed upstream. TBD whether we want to re-implement `count_star` with new API. Ref: apache/datafusion#10893
… UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917
The python wrapper now provides stddev_samp alias.
The functions now take a single expression instead of a Vec<_>. Ref: apache/datafusion#10930
`approx_percentile_cont` is now returning a DoubleArray instead of an IntArray. This may be a bug upstream; it requires further investigation.
This was changed upstream. Ref: apache/datafusion#10831
7791cac
to
1d1cd84
Compare
The alias list no longer includes the name of the function. Ref: apache/datafusion#10658
bc88926
to
774c056
Compare
`first_value` and `last_value` are currently failing and marked as xfail.
774c056
to
a38aa99
Compare
710cdc6
to
b6eee28
Compare
The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior. This allowed me to remove `marks=pytest.xfail` from the window tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#[pyfunction] | ||
pub fn approx_distinct(expression: PyExpr) -> PyExpr { | ||
functions_aggregate::expr_fn::approx_distinct::approx_distinct(expression.expr).into() | ||
} | ||
|
||
#[pyfunction] | ||
pub fn approx_median(expression: PyExpr, distinct: bool) -> PyResult<PyExpr> { | ||
let expr = functions_aggregate::expr_fn::approx_median(expression.expr); | ||
if distinct { | ||
Ok(expr.distinct().build()?.into()) | ||
} else { | ||
Ok(expr.into()) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels like these are a lot of work. Is it too difficult to instead update aggregate_function!
macro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
One of the first functions I went to port was first_value
, and I came away certain that we'd end up some Builder API around the aggregate / window functions. Additionally, the functions
currently expose different builder args.
So, I decided to go the explicit route instead of writing a macro_rules!
just to change it next release
(That's why I was excited to see you doing precisesly that Builder
API work upstream.)
def bit_and(arg: Expr, distinct: bool = False) -> Expr: | ||
"""Computes the bitwise AND of the argument.""" | ||
args = [arg.expr for arg in args] | ||
return Expr(f.bit_and(*args, distinct=distinct)) | ||
return Expr(f.bit_and(arg.expr, distinct=distinct)) | ||
|
||
|
||
def bit_or(*args: Expr, distinct: bool = False) -> Expr: | ||
def bit_or(arg: Expr, distinct: bool = False) -> Expr: | ||
"""Computes the bitwise OR of the argument.""" | ||
args = [arg.expr for arg in args] | ||
return Expr(f.bit_or(*args, distinct=distinct)) | ||
return Expr(f.bit_or(arg.expr, distinct=distinct)) | ||
|
||
|
||
def bit_xor(*args: Expr, distinct: bool = False) -> Expr: | ||
def bit_xor(arg: Expr, distinct: bool = False) -> Expr: | ||
"""Computes the bitwise XOR of the argument.""" | ||
args = [arg.expr for arg in args] | ||
return Expr(f.bit_xor(*args, distinct=distinct)) | ||
return Expr(f.bit_xor(arg.expr, distinct=distinct)) | ||
|
||
|
||
def bool_and(*args: Expr, distinct: bool = False) -> Expr: | ||
def bool_and(arg: Expr, distinct: bool = False) -> Expr: | ||
"""Computes the boolean AND of the arugment.""" | ||
args = [arg.expr for arg in args] | ||
return Expr(f.bool_and(*args, distinct=distinct)) | ||
return Expr(f.bool_and(arg.expr, distinct=distinct)) | ||
|
||
|
||
def bool_or(*args: Expr, distinct: bool = False) -> Expr: | ||
def bool_or(arg: Expr, distinct: bool = False) -> Expr: | ||
"""Computes the boolean OR of the arguement.""" | ||
args = [arg.expr for arg in args] | ||
return Expr(f.bool_or(*args, distinct=distinct)) | ||
return Expr(f.bool_or(arg.expr, distinct=distinct)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I think these are on me - I should have realized they should only have one argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nay - these wrappers were correct for v39.
The regr_*
functions were incorrect, though. But that led me to discover that the tests for those are bypassing the wrappers: #778
src/functions.rs
Outdated
} | ||
|
||
#[pyfunction] | ||
#[pyo3(signature = (*args, distinct = false, filter = None, order_by = None, null_treatment = None))] | ||
#[pyo3(signature = (expr, distinct = false, filter = None, order_by = None, null_treatment = None))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the wrappers now in place, I'm wondering if we should just remove all the pyo3 signatures just to reduce clutter in the repo. That shouldn't hold up this PR though. What do you think?
// TODO: should we just expose this in python? | ||
/// Create a COUNT(1) aggregate expression | ||
#[pyfunction] | ||
fn count_star() -> PyResult<PyExpr> { | ||
Ok(PyExpr { | ||
expr: Expr::AggregateFunction(AggregateFunction { | ||
func_def: datafusion_expr::expr::AggregateFunctionDefinition::BuiltIn( | ||
aggregate_function::AggregateFunction::Count, | ||
), | ||
args: vec![lit(1)], | ||
distinct: false, | ||
filter: None, | ||
order_by: None, | ||
null_treatment: None, | ||
}), | ||
}) | ||
fn count_star() -> PyExpr { | ||
functions_aggregate::expr_fn::count(lit(1)).into() | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the question:
I think it's worth adding some guidance about what to put in python vs what to put in rust for this repo. In my mind the things that should go into the python side are
- Trivial aliases (this is a good example)
- Simple type conversion, like path -> string of the path or number to lit(number)
- More complex type conversion if it makes sense to do in python. For example, in the
named_struct
where sending in a dictionary on the python side makes a lot more sense and isn't as simple to do via pyo3.
And then I'd lean towards everything else sitting on the rust side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add that to the functions.rs
module docs and/or somewhere in the contributor guide?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say the contributor guide since it's more general guidance and not specific to functions
@timsaucer I've captured the follow-on issues and added your two to the tracking issue #776 I'd prefer to merge large one in and follow-on with smaller PRs for the remaining. |
I've pushed this PR that just ensures we haven't missed any exports with the new wrappers. It might make sense to get it into the 40.0 release. |
An aside: To me, the process of releasing a new
In my view, this PR is just step 1, but I feel like that's not clear enough. And the "blast-radius" of the upgrade PR is such that I'd prefer to merge once CI is clean and then do the additional steps 2 and 3. |
Thank you. That is very helpful! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I planend on reviewing this today but ran out of time. I skimmed throught this and the changes look reasonable, so am approving on that basis. Thanks @Michael-J-Ward and thanks for the review @timsaucer.
Part of #776.
Closes #778.
Closes #774.
What changes are included in this PR?
Release notes for upstream
datafusion
v40Upstream
datafusion
continues its migration from built-in aggregate functions to User Defined Aggregate Functions. The APIs around these functions have been in serious flux, so the focus of this PR is to upgrade the deps and get all existing tests to pass.Our
regr_*
func wrappers are now properly tested.And aliases from
functions.rs
have been removed, leaning on our python wrappers.