Upgrade Datafusion 40 #771

Michael-J-Ward · 2024-07-24T19:31:06Z

Part of #776.
Closes #778.
Closes #774.

What changes are included in this PR?

Release notes for upstream datafusion v40

Upstream datafusion continues its migration from built-in aggregate functions to User Defined Aggregate Functions. The APIs around these functions have been in serious flux, so the focus of this PR is to upgrade the deps and get all existing tests to pass.

Our regr_* func wrappers are now properly tested.

And aliases from functions.rs have been removed, leaning on our python wrappers.

This required trait method was added upstream [0] and recommends to simply forward to `static_name`. [0]: apache/datafusion#10266

Upstream signatures were changed for the new new `AggregateBuilder` api [0]. This simply gets the code to work. We should better incorporate that API into `datafusion-python`. [0] apache/datafusion#10560

Builtin Count was removed upstream. TBD whether we want to re-implement `count_star` with new API. Ref: apache/datafusion#10893

… UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917

Ref: apache/datafusion#10964

Ref: apache/datafusion#10884

Ref: apache/datafusion#10906

Ref: apache/datafusion#10827

The python wrapper now provides stddev_samp alias.

Ref: apache/datafusion#10836

Ref: apache/datafusion#10898

The functions now take a single expression instead of a Vec<_>. Ref: apache/datafusion#10930

`approx_percentile_cont` is now returning a DoubleArray instead of an IntArray. This may be a bug upstream; it requires further investigation.

This was changed upstream. Ref: apache/datafusion#10831

The alias list no longer includes the name of the function. Ref: apache/datafusion#10658

`first_value` and `last_value` are currently failing and marked as xfail.

The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior. This allowed me to remove `marks=pytest.xfail` from the window tests.

timsaucer

Overall I don't see any showstoppers. Most of my comments are thoughts, just the one issue I see about maybe updating the macro.

If possible I would like to get these merged in before the release. Both are approved.
#768
#770

timsaucer · 2024-07-26T12:44:01Z

src/functions.rs

+#[pyfunction]
+pub fn approx_distinct(expression: PyExpr) -> PyExpr {
+    functions_aggregate::expr_fn::approx_distinct::approx_distinct(expression.expr).into()
+}
+
+#[pyfunction]
+pub fn approx_median(expression: PyExpr, distinct: bool) -> PyResult<PyExpr> {
+    let expr = functions_aggregate::expr_fn::approx_median(expression.expr);
+    if distinct {
+        Ok(expr.distinct().build()?.into())
+    } else {
+        Ok(expr.into())
+    }
+}


It feels like these are a lot of work. Is it too difficult to instead update aggregate_function! macro?

I agree.

One of the first functions I went to port was first_value, and I came away certain that we'd end up some Builder API around the aggregate / window functions. Additionally, the functions currently expose different builder args.

So, I decided to go the explicit route instead of writing a macro_rules! just to change it next release

(That's why I was excited to see you doing precisesly that Builder API work upstream.)

timsaucer · 2024-07-26T12:46:53Z

python/datafusion/functions.py

+def bit_and(arg: Expr, distinct: bool = False) -> Expr:
    """Computes the bitwise AND of the argument."""
-    args = [arg.expr for arg in args]
-    return Expr(f.bit_and(*args, distinct=distinct))
+    return Expr(f.bit_and(arg.expr, distinct=distinct))


-def bit_or(*args: Expr, distinct: bool = False) -> Expr:
+def bit_or(arg: Expr, distinct: bool = False) -> Expr:
    """Computes the bitwise OR of the argument."""
-    args = [arg.expr for arg in args]
-    return Expr(f.bit_or(*args, distinct=distinct))
+    return Expr(f.bit_or(arg.expr, distinct=distinct))


-def bit_xor(*args: Expr, distinct: bool = False) -> Expr:
+def bit_xor(arg: Expr, distinct: bool = False) -> Expr:
    """Computes the bitwise XOR of the argument."""
-    args = [arg.expr for arg in args]
-    return Expr(f.bit_xor(*args, distinct=distinct))
+    return Expr(f.bit_xor(arg.expr, distinct=distinct))


-def bool_and(*args: Expr, distinct: bool = False) -> Expr:
+def bool_and(arg: Expr, distinct: bool = False) -> Expr:
    """Computes the boolean AND of the arugment."""
-    args = [arg.expr for arg in args]
-    return Expr(f.bool_and(*args, distinct=distinct))
+    return Expr(f.bool_and(arg.expr, distinct=distinct))


-def bool_or(*args: Expr, distinct: bool = False) -> Expr:
+def bool_or(arg: Expr, distinct: bool = False) -> Expr:
    """Computes the boolean OR of the arguement."""
-    args = [arg.expr for arg in args]
-    return Expr(f.bool_or(*args, distinct=distinct))
+    return Expr(f.bool_or(arg.expr, distinct=distinct))


Thank you. I think these are on me - I should have realized they should only have one argument.

Nay - these wrappers were correct for v39.

The regr_* functions were incorrect, though. But that led me to discover that the tests for those are bypassing the wrappers: #778

timsaucer · 2024-07-26T12:48:44Z

src/functions.rs


 #[pyfunction]
-#[pyo3(signature = (*args, distinct = false, filter = None, order_by = None, null_treatment = None))]
+#[pyo3(signature = (expr, distinct = false, filter = None, order_by = None, null_treatment = None))]


With the wrappers now in place, I'm wondering if we should just remove all the pyo3 signatures just to reduce clutter in the repo. That shouldn't hold up this PR though. What do you think?

timsaucer · 2024-07-26T12:53:04Z

src/functions.rs

+// TODO: should we just expose this in python?
 /// Create a COUNT(1) aggregate expression
 #[pyfunction]
-fn count_star() -> PyResult<PyExpr> {
-    Ok(PyExpr {
-        expr: Expr::AggregateFunction(AggregateFunction {
-            func_def: datafusion_expr::expr::AggregateFunctionDefinition::BuiltIn(
-                aggregate_function::AggregateFunction::Count,
-            ),
-            args: vec![lit(1)],
-            distinct: false,
-            filter: None,
-            order_by: None,
-            null_treatment: None,
-        }),
-    })
+fn count_star() -> PyExpr {
+    functions_aggregate::expr_fn::count(lit(1)).into()
+}
+


Per the question:

I think it's worth adding some guidance about what to put in python vs what to put in rust for this repo. In my mind the things that should go into the python side are

Trivial aliases (this is a good example)

Simple type conversion, like path -> string of the path or number to lit(number)

More complex type conversion if it makes sense to do in python. For example, in the named_struct where sending in a dictionary on the python side makes a lot more sense and isn't as simple to do via pyo3.

And then I'd lean towards everything else sitting on the rust side.

Should I add that to the functions.rs module docs and/or somewhere in the contributor guide?

I'd say the contributor guide since it's more general guidance and not specific to functions

Michael-J-Ward · 2024-07-26T16:00:22Z

@timsaucer I've captured the follow-on issues and added your two to the tracking issue #776

I'd prefer to merge large one in and follow-on with smaller PRs for the remaining.

Ref: apache#779

timsaucer · 2024-07-27T13:28:19Z

I've pushed this PR that just ensures we haven't missed any exports with the new wrappers. It might make sense to get it into the 40.0 release.

#782

Michael-J-Ward · 2024-07-27T17:13:05Z

An aside:

To me, the process of releasing a new datafusion-python version:

Upgrade the datafusion deps and migrate code so new code compiles
Integrate new upstream features that are available.
Any additional fixes / improvements we want to get into this release
Tag a commit as new version and do the release.

In my view, this PR is just step 1, but I feel like that's not clear enough.

And the "blast-radius" of the upgrade PR is such that I'd prefer to merge once CI is clean and then do the additional steps 2 and 3.

timsaucer · 2024-07-27T18:09:23Z

Thank you. That is very helpful!

…t_value

Closes apache#778

andygrove

I planend on reviewing this today but ran out of time. I skimmed throught this and the changes look reasonable, so am approving on that basis. Thanks @Michael-J-Ward and thanks for the review @timsaucer.

Michael-J-Ward marked this pull request as draft July 24, 2024 19:32

Michael-J-Ward marked this pull request as ready for review July 24, 2024 23:20

Michael-J-Ward mentioned this pull request Jul 25, 2024

Bugfix: Calling count with None arguments #768

Merged

Michael-J-Ward marked this pull request as draft July 25, 2024 02:31

Michael-J-Ward added 22 commits July 25, 2024 09:52

chore: update datafusion deps

8d3a215

feat: impl ExecutionPlan::static_name() for DatasetExec

0179c6f

This required trait method was added upstream [0] and recommends to simply forward to `static_name`. [0]: apache/datafusion#10266

feat: update first_value and last_value wrappers.

61f5ea3

Upstream signatures were changed for the new new `AggregateBuilder` api [0]. This simply gets the code to work. We should better incorporate that API into `datafusion-python`. [0] apache/datafusion#10560

migrate count to UDAF

86d1ad0

Builtin Count was removed upstream. TBD whether we want to re-implement `count_star` with new API. Ref: apache/datafusion#10893

migrate approx_percentile_cont, approx_distinct, and approx_median to…

f4a0828

… UDAF Ref: approx_distinct apache/datafusion#10851 Ref: approx_median apache/datafusion#10840 Ref: approx_percentile_cont and _with_weight apache/datafusion#10917

migrate avg to UDAF

3a277a8

Ref: apache/datafusion#10964

migrage corr to UDAF

98498e9

Ref: apache/datafusion#10884

migrate grouping to UDAF

f1717a2

Ref: apache/datafusion#10906

add alias mean for UDAF avg

c790454

migrate stddev to UDAF

d0018ea

Ref: apache/datafusion#10827

remove rust alias for stddev

d0ae202

The python wrapper now provides stddev_samp alias.

migrage var_pop to UDAF

40d9d3e

Ref: apache/datafusion#10836

migrate regr_* functions to UDAF

86d9d9b

Ref: apache/datafusion#10898

migrate bitwise functions to UDAF

8881e2a

The functions now take a single expression instead of a Vec<_>. Ref: apache/datafusion#10930

add missing variants for ScalarValue with todo

c754d82

fix typo in approx_percentile_cont

e3a4a7a

add distinct arg to count

9b2f63b

comment out failing test

6539c0c

`approx_percentile_cont` is now returning a DoubleArray instead of an IntArray. This may be a bug upstream; it requires further investigation.

update tests to expect lowercase sum in query plans

11e601f

This was changed upstream. Ref: apache/datafusion#10831

update ScalarType data_type map

77b24e3

add docs dependency pickleshare

3b54873

re-implement count_star

1d1cd84

Michael-J-Ward force-pushed the upgrade-40 branch from 7791cac to 1d1cd84 Compare July 25, 2024 14:56

Michael-J-Ward added 3 commits July 25, 2024 10:00

lint: ruff python lint

e6775a3

lint: rust cargo fmt

8ca3469

include name of window function in error for find_window_fn

a521310

Michael-J-Ward added 2 commits July 25, 2024 11:27

search default aggregate functions by both name and aliases

df2cbad

The alias list no longer includes the name of the function. Ref: apache/datafusion#10658

fix markdown in find_window_fn docs

a8a6c9d

Michael-J-Ward force-pushed the upgrade-40 branch from bc88926 to 774c056 Compare July 25, 2024 17:38

parameterize test_window_functions

a38aa99

`first_value` and `last_value` are currently failing and marked as xfail.

Michael-J-Ward force-pushed the upgrade-40 branch from 774c056 to a38aa99 Compare July 25, 2024 17:46

add test ids to test_simple_select tests marked xfail

b6eee28

Michael-J-Ward force-pushed the upgrade-40 branch from 710cdc6 to b6eee28 Compare July 25, 2024 17:47

Michael-J-Ward added 3 commits July 25, 2024 13:28

update find_window_fn to search built-ins first

39893fc

The behavior of `first_value` and `last_value` UDAFs currently does not match the built-in behavior. This allowed me to remove `marks=pytest.xfail` from the window tests.

improve first_call and last_call use of the builder API

22f372b

remove trailing todos

7211ba6

Michael-J-Ward marked this pull request as ready for review July 25, 2024 19:48

Michael-J-Ward changed the title ~~WIP: Upgrade Datafusion 40~~ Upgrade Datafusion 40 Jul 25, 2024

fix examples/substrait.py

a75861d

timsaucer reviewed Jul 26, 2024

View reviewed changes

Michael-J-Ward mentioned this pull request Jul 26, 2024

clarify separation between rust code and python wrappers #779

Closed

Michael-J-Ward added 4 commits July 26, 2024 11:06

chore: remove explicit aliases from functions.rs

0c57e59

Ref: apache#779

remove array_fn! aliases

eb905fb

remove alias rules for expr_fn_vec!

00779b7

remove alias rules from expr_fn! macro

dc2db41

timsaucer approved these changes Jul 26, 2024

View reviewed changes

Michael-J-Ward added 4 commits July 29, 2024 11:08

remove unnecessary pyo3 var-arg signatures in functions.rs

65ea065

remove pyo3 signatures that provided defaults for first_value and las…

2009741

…t_value

parametrize test_string_functions

e17ba64

test regr_ function wrappers

d293398

Closes apache#778

andygrove approved these changes Jul 31, 2024

View reviewed changes

andygrove merged commit f580155 into apache:main Jul 31, 2024

Upgrade Datafusion 40 #771

Upgrade Datafusion 40 #771

Uh oh!

Conversation

Michael-J-Ward commented Jul 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes are included in this PR?

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

Uh oh!

timsaucer Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

Michael-J-Ward Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

timsaucer Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

Michael-J-Ward Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

timsaucer Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

Michael-J-Ward Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

timsaucer Jul 26, 2024

Choose a reason for hiding this comment

Uh oh!

Michael-J-Ward commented Jul 26, 2024

Uh oh!

timsaucer commented Jul 27, 2024

Uh oh!

Michael-J-Ward commented Jul 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timsaucer commented Jul 27, 2024

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Michael-J-Ward commented Jul 24, 2024 •

edited

Loading

Michael-J-Ward Jul 26, 2024 •

edited

Loading

Michael-J-Ward commented Jul 27, 2024 •

edited

Loading