Cache common referred expression at the window input #9009

mustafasrepo · 2024-01-26T14:45:35Z

Which issue does this PR close?

Closes #.

Rationale for this change

The PR8960 retracted 2 of the window tests. This PR fixes these retracted tests also adds a new feature for caching common expressions at the window input.

As an example consider the following query

SELECT c3,
    SUM(c9) OVER(ORDER BY c3+c4 DESC) as sum1,
    SUM(c9) OVER(ORDER BY c3+c4 ASC) as sum2
    FROM aggregate_test_100

which will generate following logical plan.

+   Projection: aggregate_test_100.c3, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS sum2
+   --Limit: skip=0, fetch=5
+   ----WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3 AS aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+   ------WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3 AS aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+   --------Projection: aggregate_test_100.c3 + aggregate_test_100.c4 AS aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3, aggregate_test_100.c3, aggregate_test_100.c9
+   ----------TableScan: aggregate_test_100 projection=[c3, c4, c9]

where expression c3+c4 is computed in the first Projection and its result (which is a column) is used in subsequent WindowAggr. This is done with the CommonSubexprEliminate rule.

However, for the following query

SELECT c3,
    SUM(c9) OVER(ORDER BY c3+c4 DESC, c9 DESC, c2 ASC) as sum1,
    SUM(c9) OVER(ORDER BY c3+c4 ASC, c9 ASC ) as sum2
    FROM aggregate_test_100

datafusion generates following plan:

+   Projection: aggregate_test_100.c3, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST, aggregate_test_100.c9 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS sum2
+   --WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST, aggregate_test_100.c9 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+   ----Projection: aggregate_test_100.c3, aggregate_test_100.c4, aggregate_test_100.c9, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
+   ------WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+   --------TableScan: aggregate_test_100 projection=[c2, c3, c4, c9]

where computation c3+c4 couldn't cache with a Projection before first WindowAggr. The reason is that, each WindowAggr refers to c3+c4 once. Hence CommonSubExpr rule doesn't think "removing it is helpful".

What changes are included in this PR?

This PR fixes above problem so that CommonSubExpr considers consecutive window operators during common sub expression substitute analysis. With this analysis we can generate following logical plan for the second query:

+   Projection: aggregate_test_100.c3, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS sum1, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST, aggregate_test_100.c9 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS sum2
+   --WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3 AS aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST, aggregate_test_100.c9 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 ASC NULLS LAST, aggregate_test_100.c9 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+   ----Projection: aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3, aggregate_test_100.c3, aggregate_test_100.c9, SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
+   ------WindowAggr: windowExpr=[[SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3 AS aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(aggregate_test_100.c9) ORDER BY [aggregate_test_100.c3 + aggregate_test_100.c4 DESC NULLS FIRST, aggregate_test_100.c9 DESC NULLS FIRST, aggregate_test_100.c2 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
+   --------Projection: aggregate_test_100.c3 + aggregate_test_100.c4 AS aggregate_test_100.c3 + aggregate_test_100.c4aggregate_test_100.c4aggregate_test_100.c3, aggregate_test_100.c2, aggregate_test_100.c3, aggregate_test_100.c9
+   ----------TableScan: aggregate_test_100 projection=[c2, c3, c4, c9]

where common computation is cached with Projection before first window.

Are these changes tested?

Yes

Are there any user-facing changes?

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb

Thank you for this contribution @mustafasrepo -- this PR looks good to me. Also, I found the description on this PR very clear and well written. Thank you very much 🙏

One thought I had was will there be a problem if there is a subquery that would end up with a nested WindowAggExec that could be incorrectly optimized away 🤔

Something like

SELECT c3,
    SUM(c9) OVER(ORDER BY c3+c4 ASC) as sum2,
    sum1,
    FROM (
      SELECT c3, c4, c9, 
      SUM(c9) OVER(ORDER BY c3+c4 DESC) as sum1,
      FROM aggregate_test_100
    )

cc @waynexia and @haohuaijin

alamb · 2024-01-26T21:36:22Z

datafusion/optimizer/src/common_subexpr_eliminate.rs

-        let input_schema = Arc::clone(input.schema());
-        let arrays =
-            to_arrays(window_expr, input_schema, &mut expr_set, ExprMask::Normal)?;
+        // Get all window expressions inside the consecutive window operators.


Perhaps we can add a comment here about why this is recursively looking down into all window operations (e.g. because they all get the same input schema and append on some window functions, but the window functions can't refer to previous window functions).

I think perhaps you could reuse the (very nicely written) description from this PR which explains it very well

haohuaijin

Thanks @mustafasrepo and @alamb, look great to me!

I also have the same question as @alamb.

datafusion/optimizer/src/common_subexpr_eliminate.rs

Co-authored-by: Huaijin <haohuaijin@gmail.com>

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

mustafasrepo · 2024-01-29T08:14:51Z

Thank you for this contribution @mustafasrepo -- this PR looks good to me. Also, I found the description on this PR very clear and well written. Thank you very much 🙏

One thought I had was will there be a problem if there is a subquery that would end up with a nested WindowAggExec that could be incorrectly optimized away 🤔

Something like
SELECT c3,
    SUM(c9) OVER(ORDER BY c3+c4 ASC) as sum2,
    sum1,
    FROM (
      SELECT c3, c4, c9, 
      SUM(c9) OVER(ORDER BY c3+c4 DESC) as sum1,
      FROM aggregate_test_100
    )

I think, in these cases, we will generate a sub-optimal plan, where a complex expression is calculated more than once by subsequent operators. However, didn't cached (Previous behaviour). However, I don't think we will generate an invalid plan. I added your example as a test case also in this PR.

I think as a future PR, we can analyze plan from top down to count expression referral count, for better calculating referral counts across plan.

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb · 2024-01-29T12:25:01Z

However, didn't cached (Previous behaviour). However, I don't think we will generate an invalid plan. I added your example as a test case also in this PR.

Thank you

Thank you for this contribution @mustafasrepo -- this PR looks good to me. Also, I found the description on this PR very clear and well written. Thank you very much 🙏

One thought I had was will there be a problem if there is a subquery that would end up with a nested WindowAggExec that could be incorrectly optimized away 🤔

Something like

SELECT c3,
SUM(c9) OVER(ORDER BY c3+c4 ASC) as sum2,
sum1,
FROM (
SELECT c3, c4, c9,
SUM(c9) OVER(ORDER BY c3+c4 DESC) as sum1,
FROM aggregate_test_100
)
I think, in these cases, we will generate a sub-optimal plan, where a complex expression is calculated more than once by subsequent operators. However, didn't cached (Previous behaviour). However, I don't think we will generate an invalid plan. I added your example as a test case also in this PR.

I think as a future PR, we can analyze plan from top down to count expression referral count, for better calculating referral counts across plan.

Yes, I agree there is no need to optimize this case as part of this PR, and since it gives correct results, lets 🚀

mustafasrepo added 22 commits January 25, 2024 13:48

Initial commit

8a1705b

Update test

20d5aba

Minor changes

ee34597

Tmp

001f1d5

Retract some changes

b6379d2

Add lias to window

6b89835

Fix name change issue

f019ce5

Minor changes

c39d59c

Minor changes

5d35279

Un comment new rule

bff5989

Open up new rules

d3a8e9b

Minor changes

ebf6a0c

Change test

c12af19

remove prints

660aaa0

Update slt tests

e1d0126

Remove leftover code

1e5ffd4

Resolve linter errors

727abab

Minor changes

a2cfbf0

Merge branch 'apache_main' into feature/window_expr_refer

c5f567c

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Remove group window rule

386fc25

Remove unnecessary changes

132771c

Minor changes

ea6cfac

github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Jan 26, 2024

alamb mentioned this pull request Jan 26, 2024

DataFusion weekly project plan (Andrew Lamb) - Jan 22, 2024 #8933

Closed

9 tasks

alamb approved these changes Jan 26, 2024

View reviewed changes

haohuaijin approved these changes Jan 27, 2024

View reviewed changes

datafusion/optimizer/src/common_subexpr_eliminate.rs Outdated Show resolved Hide resolved

mustafasrepo and others added 3 commits January 29, 2024 10:33

Update datafusion/optimizer/src/common_subexpr_eliminate.rs

db697d0

Co-authored-by: Huaijin <haohuaijin@gmail.com>

Merge branch 'apache_main' into feature/window_expr_refer

f203c0e

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Update comment, add new test

13f532c

Merge branch 'apache_main' into feature/window_expr_refer

d4a81d7

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb approved these changes Jan 29, 2024

View reviewed changes

alamb merged commit a57e270 into apache:main Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache common referred expression at the window input #9009

Cache common referred expression at the window input #9009

Uh oh!

mustafasrepo commented Jan 26, 2024

Uh oh!

alamb left a comment •

edited

Loading

Uh oh!

alamb Jan 26, 2024

Uh oh!

haohuaijin left a comment

Uh oh!

Uh oh!

mustafasrepo commented Jan 29, 2024

Uh oh!

alamb commented Jan 29, 2024

Uh oh!

Uh oh!

Cache common referred expression at the window input #9009

Cache common referred expression at the window input #9009

Uh oh!

Conversation

mustafasrepo commented Jan 26, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Jan 26, 2024

Choose a reason for hiding this comment

Uh oh!

haohuaijin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mustafasrepo commented Jan 29, 2024

Uh oh!

alamb commented Jan 29, 2024

Uh oh!

Uh oh!

alamb left a comment •

edited

Loading