Collapse sort into window expr and do sort within logical phase #571

jimexist · 2021-06-16T15:51:21Z

Which issue does this PR close?

Based on #564 so review that first

Closes #573
Closes #526

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

codecov-commenter · 2021-06-18T01:29:21Z

Codecov Report

Merging #571 (dd93f0e) into master (51e5445) will increase coverage by 0.07%.
The diff coverage is 89.65%.

@@            Coverage Diff             @@
##           master     #571      +/-   ##
==========================================
+ Coverage   76.02%   76.10%   +0.07%     
==========================================
  Files         156      156              
  Lines       27063    27145      +82     
==========================================
+ Hits        20575    20658      +83     
+ Misses       6488     6487       -1

Impacted Files	Coverage Δ
datafusion/src/physical_plan/mod.rs	`80.00% <78.57%> (+3.27%)`	⬆️
datafusion/src/physical_plan/windows.rs	`82.59% <79.41%> (+0.01%)`	⬆️
datafusion/src/physical_plan/planner.rs	`80.20% <92.59%> (+0.85%)`	⬆️
datafusion/src/execution/context.rs	`92.13% <100.00%> (+0.09%)`	⬆️
datafusion/src/logical_plan/expr.rs	`84.58% <100.00%> (+0.30%)`	⬆️
...afusion/src/physical_plan/expressions/nth_value.rs	`79.41% <100.00%> (+0.62%)`	⬆️
datafusion/src/sql/planner.rs	`84.72% <100.00%> (-0.04%)`	⬇️
datafusion/src/sql/utils.rs	`68.85% <100.00%> (ø)`
datafusion/tests/sql.rs	`99.29% <100.00%> (+<0.01%)`	⬆️
... and 2 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51e5445...dd93f0e. Read the comment docs.

jimexist · 2021-06-21T12:56:40Z

performance comparison:

Benchmarking window empty over, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 9.6s, enable flat sampling, or reduce sample count to 50.
window empty over, aggregate functions
                        time:   [1.8359 ms 1.8475 ms 1.8610 ms]
                        change: [-0.9132% +0.5180% +2.1212%] (p = 0.50 > 0.05)
                        No change in performance detected.
Found 11 outliers among 100 measurements (11.00%)
  7 (7.00%) high mild
  4 (4.00%) high severe

Benchmarking window empty over, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 7.0s, enable flat sampling, or reduce sample count to 50.
window empty over, built-in functions
                        time:   [1.3685 ms 1.3760 ms 1.3844 ms]
                        change: [-4.4221% -2.8960% -1.5378%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

window order by, aggregate functions
                        time:   [8.1463 ms 8.2384 ms 8.3348 ms]
                        change: [-4.8125% -3.2407% -1.6147%] (p = 0.00 < 0.05)
                        Performance has improved.

window order by, built-in functions
                        time:   [7.4629 ms 7.5195 ms 7.5799 ms]
                        change: [-0.3291% +0.5924% +1.5385%] (p = 0.23 > 0.05)
                        No change in performance detected.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

window partition by, u64_wide, aggregate functions
                        time:   [18.755 ms 18.945 ms 19.140 ms]
                        change: [+4.0084% +5.3579% +6.5762%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

window partition by, u64_narrow, aggregate functions
                        time:   [6.3792 ms 6.4266 ms 6.4811 ms]
                        change: [-2.6467% -1.6117% -0.4429%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe

window partition by, u64_wide, built-in functions
                        time:   [16.257 ms 16.400 ms 16.551 ms]
                        change: [-0.1608% +0.9364% +2.1068%] (p = 0.11 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

window partition by, u64_narrow, built-in functions
                        time:   [5.9251 ms 5.9624 ms 6.0024 ms]
                        change: [+0.8259% +1.7545% +2.7146%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

window partition and order by, u64_wide, aggregate functions
                        time:   [24.977 ms 25.136 ms 25.308 ms]
                        change: [-2.8450% -1.7267% -0.6014%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking window partition and order by, u64_narrow, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 5.4s, or reduce sample count to 90.
window partition and order by, u64_narrow, aggregate functions
                        time:   [54.278 ms 54.907 ms 55.598 ms]
                        change: [+0.8203% +2.1981% +3.7199%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  6 (6.00%) high mild
  7 (7.00%) high severe

window partition and order by, u64_wide, built-in functions
                        time:   [23.541 ms 23.740 ms 23.949 ms]
                        change: [-1.1173% +0.2109% +1.4568%] (p = 0.74 > 0.05)
                        No change in performance detected.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

window partition and order by, u64_narrow, built-in functions
                        time:   [15.821 ms 15.958 ms 16.103 ms]
                        change: [+0.3405% +1.5444% +2.6931%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

consider that the test data are generated per run, i would ignore 5%-ish variations.

jimexist · 2021-06-21T14:01:02Z

with 1million records and 8k batch size

window empty over, aggregate functions
                        time:   [31.516 ms 31.699 ms 31.875 ms]
                        change: [-2.6548% -1.7125% -0.8447%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

window empty over, built-in functions
                        time:   [24.184 ms 24.441 ms 24.736 ms]
                        change: [-5.3716% -3.9746% -2.5772%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  4 (4.00%) high severe

Benchmarking window order by, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 16.9s, or reduce sample count to 20.
window order by, aggregate functions
                        time:   [167.08 ms 168.43 ms 169.84 ms]
                        change: [-1.6618% -0.5790% +0.4650%] (p = 0.31 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe

Benchmarking window order by, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 15.5s, or reduce sample count to 30.
window order by, built-in functions
                        time:   [152.19 ms 153.62 ms 155.08 ms]
                        change: [-2.6093% -1.1311% +0.2267%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking window partition by, u64_wide, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 24.8s, or reduce sample count to 20.
window partition by, u64_wide, aggregate functions
                        time:   [239.96 ms 242.61 ms 245.37 ms]
                        change: [-5.2335% -3.7045% -2.2581%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking window partition by, u64_narrow, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 13.1s, or reduce sample count to 30.
window partition by, u64_narrow, aggregate functions
                        time:   [125.74 ms 127.00 ms 128.34 ms]
                        change: [-9.6106% -8.1814% -6.6467%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking window partition by, u64_wide, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 21.5s, or reduce sample count to 20.
window partition by, u64_wide, built-in functions
                        time:   [207.51 ms 208.79 ms 210.09 ms]
                        change: [-3.9621% -2.8513% -1.7668%] (p = 0.00 < 0.05)
                        Performance has improved.

Benchmarking window partition by, u64_narrow, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 11.2s, or reduce sample count to 40.
window partition by, u64_narrow, built-in functions
                        time:   [111.40 ms 112.44 ms 113.55 ms]
                        change: [-4.1476% -2.8162% -1.5589%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking window partition and order by, u64_wide, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 43.5s, or reduce sample count to 10.
window partition and order by, u64_wide, aggregate functions
                        time:   [420.60 ms 423.97 ms 427.48 ms]
                        change: [-2.0448% -0.9388% +0.1643%] (p = 0.10 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Benchmarking window partition and order by, u64_narrow, aggregate functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 62.5s, or reduce sample count to 10.
window partition and order by, u64_narrow, aggregate functions
                        time:   [617.40 ms 623.61 ms 630.15 ms]
                        change: [+1.3574% +2.6250% +3.9925%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

Benchmarking window partition and order by, u64_wide, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 38.7s, or reduce sample count to 10.
window partition and order by, u64_wide, built-in functions
                        time:   [382.85 ms 386.08 ms 389.30 ms]
                        change: [-0.7637% +0.3990% +1.5670%] (p = 0.51 > 0.05)
                        No change in performance detected.

Benchmarking window partition and order by, u64_narrow, built-in functions: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 37.8s, or reduce sample count to 10.
window partition and order by, u64_narrow, built-in functions
                        time:   [374.39 ms 377.18 ms 380.05 ms]
                        change: [-3.3808% -2.2769% -1.1041%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

jimexist · 2021-06-22T11:14:47Z

this will make way for #569

alamb · 2021-06-22T22:43:03Z

I am sorry I am behind on reviews in DataFusion -- I plan to work through the backlog tomorrow

datafusion/src/physical_plan/planner.rs

alamb · 2021-06-23T13:12:54Z

datafusion/src/physical_plan/planner.rs


+                // at this moment we are guaranteed by the logical planner
+                // to have all the window_expr to have equal sort key


Is the design that the planner will group all WindowExpr that have the same window (ORDER BY, PARTITION BY) into the same LogicalPlan:Window node? That sounds like a good plan to me 👍

It might (eventually) be worth validating that invariant here and asserting (or returning an Error` if the sort keys are not the same for multiple window exprs (mostly to catch potential bugs faster in the future)

i can add the assertion later. yes the invariant is real.

alamb · 2021-06-23T13:16:31Z

datafusion/src/sql/planner.rs

-        \n      WindowAggr: windowExpr=[[MIN(#orders.qty)]]\
-        \n        Sort: #orders.order_id DESC NULLS FIRST\
-        \n          TableScan: orders projection=None";
+        \n  WindowAggr: windowExpr=[[MAX(#orders.qty) ORDER BY [#orders.order_id ASC NULLS FIRST]]]\


Yeah this is cool to see two different WindowAggr nodes with different ORDER BY clauses. 👍

rust uses timsort, and it shall be quick (to reverse an already sorted vec).

ref: https://en.wikipedia.org/wiki/Timsort#Descending_runs

alamb · 2021-06-23T13:17:40Z

datafusion/src/logical_plan/expr.rs

@@ -1452,11 +1452,18 @@ impl fmt::Debug for Expr {
            }
            Expr::WindowFunction {
                fun,
-                ref args,
+                args,
+                partition_by,


It seems like the partition_by and order_by more logically belong on the LogicalPlan::Window node if they can be shared across Expr::WindowFunction

that's a separate construct not yet available in the sql parser.

see https://www.postgresql.org/docs/current/tutorial-window.html

When a query involves multiple window functions, it is possible to write out each one with a separate OVER clause, but this is duplicative and error-prone if the same windowing behavior is wanted for several functions. Instead, each windowing behavior can be named in a WINDOW clause and then referenced in OVER. For example:

SELECT sum(salary) OVER w, avg(salary) OVER w FROM empsalary WINDOW w AS (PARTITION BY depname ORDER BY salary DESC);

It seems like the partition_by and order_by more logically belong on the LogicalPlan::Window node if they can be shared across Expr::WindowFunction

for logically planned window function node it's not necessarily the case, because sort order is defined by the concatenations of partition_by and then order_by, but consider:

select max(c1) over (partition by c2 order by c3), max(c1) over (order by c2, c3) from test;

both window functions will have the same sort key of [c2 asc nulls first, c3 asc nulls first] but they mean different things, and they might be logically planned together (meaning same pre-sorting happens) but not the same evaluation will happen

I see -- this makes sense. Thank you for the clarification

Perhaps it is worth a comment (in a follow on PR) on the LogicalPlan window to this effect

alamb · 2021-06-23T13:17:58Z

Looks like a good step forward to me

alamb · 2021-06-23T15:35:55Z

datafusion/src/logical_plan/expr.rs

@@ -1452,11 +1452,18 @@ impl fmt::Debug for Expr {
            }
            Expr::WindowFunction {
                fun,
-                ref args,
+                args,
+                partition_by,


I see -- this makes sense. Thank you for the clarification

Perhaps it is worth a comment (in a follow on PR) on the LogicalPlan window to this effect

alamb · 2021-06-23T21:23:47Z

@jimexist there appears to be a build failure now:


   Compiling lz4 v1.23.2
   Compiling zstd v0.8.3+zstd.1.5.0
   Compiling datafusion v4.0.0-SNAPSHOT (/__w/arrow-datafusion/arrow-datafusion/datafusion)
error[E0282]: type annotations needed
   --> datafusion/src/sql/planner.rs:699:17
    |
699 |             let window_exprs = exprs.into_iter().cloned().collect();
    |                 ^^^^^^^^^^^^ consider giving `window_exprs` a type

error: aborting due to previous error

unless there are any other comments I'll plan to merge this tomorrow when the build issue gets fixed

jimexist · 2021-06-24T00:05:33Z

datafusion/src/physical_plan/planner.rs

+
+                let sort_keys = get_sort_keys(&window_expr[0]);
+                if window_expr.len() > 1 {
+                    debug_assert!(


@alamb added this

alamb · 2021-06-24T14:21:14Z

datafusion/src/physical_plan/planner.rs

+
+                let sort_keys = get_sort_keys(&window_expr[0]);
+                if window_expr.len() > 1 {
+                    debug_assert!(


alamb · 2021-06-24T14:21:31Z

Thanks again @jimexist

jimexist force-pushed the window-aggr-collapse-sort branch 3 times, most recently from 906b62d to f4341d4 Compare June 21, 2021 12:49

jimexist changed the title ~~WIP collapse sort into window expr and do sort within logical phase~~ Collapse sort into window expr and do sort within logical phase Jun 21, 2021

jimexist marked this pull request as ready for review June 21, 2021 12:50

jimexist force-pushed the window-aggr-collapse-sort branch 2 times, most recently from eef960a to 2f83560 Compare June 22, 2021 10:12

jimexist mentioned this pull request Jun 22, 2021

Use repartition in window functions to speed up #569

Merged

jimexist force-pushed the window-aggr-collapse-sort branch 2 times, most recently from 7845ac4 to e48fdd3 Compare June 23, 2021 00:47

houqp reviewed Jun 23, 2021

View reviewed changes

datafusion/src/physical_plan/planner.rs Outdated Show resolved Hide resolved

datafusion/src/physical_plan/planner.rs Outdated Show resolved Hide resolved

jimexist force-pushed the window-aggr-collapse-sort branch 2 times, most recently from ce6a559 to 7553b89 Compare June 23, 2021 07:42

alamb added the datafusion Changes in the datafusion crate label Jun 23, 2021

alamb approved these changes Jun 23, 2021

View reviewed changes

jimexist force-pushed the window-aggr-collapse-sort branch from 7553b89 to 5b44dce Compare June 23, 2021 15:34

alamb approved these changes Jun 23, 2021

View reviewed changes

implement window functions with partition by

20c19ad

jimexist force-pushed the window-aggr-collapse-sort branch from 5b44dce to 20c19ad Compare June 24, 2021 00:05

jimexist commented Jun 24, 2021

View reviewed changes

alamb approved these changes Jun 24, 2021

View reviewed changes

alamb merged commit 0d10dce into apache:master Jun 24, 2021

jimexist deleted the window-aggr-collapse-sort branch June 24, 2021 14:25

houqp added the performance Make DataFusion faster label Jul 31, 2021


		// at this moment we are guaranteed by the logical planner
		// to have all the window_expr to have equal sort key

Collapse sort into window expr and do sort within logical phase #571

Collapse sort into window expr and do sort within logical phase #571

Uh oh!

Conversation

jimexist commented Jun 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

codecov-commenter commented Jun 18, 2021

Codecov Report

Uh oh!

jimexist commented Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimexist commented Jun 21, 2021

Uh oh!

jimexist commented Jun 22, 2021

Uh oh!

alamb commented Jun 22, 2021

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jimexist Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 23, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 24, 2021

Uh oh!

Uh oh!

jimexist commented Jun 16, 2021 •

edited

Loading

jimexist commented Jun 21, 2021 •

edited

Loading

jimexist Jun 23, 2021 •

edited

Loading