-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update code comment for the cases of regularized RANGE frame and add tests for ORDER BY cases with RANGE frame #8410
Conversation
// 1. start bound is UNBOUNDED or CURRENT ROW | ||
// 2. end bound is CURRENT ROW or UNBOUNDED. | ||
// In these cases, we regularize the RANGE frame to be equivalent to a ROWS | ||
// frame with the UNBOUNDED bounds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that actually Spark doesn't allow empty ORDER BY clause for RANGE frame in any cases. I updated this according to the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So DataFusion is following Postgres behavior on this.
postgres=# select a, rank() over (partition by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR: RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a RANGE between 1 PRECED...
^
postgres=# select a, rank() over (partition by a RANGE between UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk from (select 1 a) q;
a | rnk
---+-----
1 | 1
(1 row)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
postgres=# select a, rank() over (partition by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR: RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a RANGE between 1 PRECED...
^
postgres=# select a, rank() over (partition by a order by a RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
a | rnk
---+-----
1 | 1
(1 row)
postgres=# select a, rank() over (partition by a order by a, a + 1 RANGE between 1 PRECEDING AND 2 FOLLOWING) rnk from (select 1 a) q;
ERROR: RANGE with offset PRECEDING/FOLLOWING requires exactly one ORDER BY column
LINE 1: select a, rank() over (partition by a order by a, a + 1 RANG...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, actually Postgres behavior allows empty ORDER BY clause, one column or multiple column ORDER BY clause with RANGE frame if start bound/end bound is UNBOUNDED or CURRENT ROW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In short, DataFusion currently allows various ORDER BY
clause cases (empty, 1 or more column) with RANGE frame, similar to Postgres.
But for empty ORDER BY
clause case, the window function result is different to Postgres.
See added tests.
// If an ORDER BY clause is absent, the frame is equivalent to a ROWS | ||
// frame with the UNBOUNDED bounds. | ||
// If an ORDER BY clause is present but has more than one column, the | ||
// frame is unchanged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, based on Postgres behavior, it doesn't change the RANGE frame to ROWS frame (see the added tests in sqllogictest which return different results than Postgres).
Based on Postgres doc, without ORDER BY
, Postgres treats all rows of the partition as peers of the current row (https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-WINDOW-FUNCTIONS).
# TODO: this is different to Postgres which returns [1, 1] for `rnk`. | ||
query I | ||
select rank() over (order by 1 RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) rnk | ||
from (select 1 a union select 2 a) q ORDER BY rnk | ||
---- | ||
1 | ||
2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a fix for the issue locally. But this PR focuses on doc and test change, so I will submit the fix after this is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch and great job! Thank you @viirya.
This PR has been very carefully described and explained
Thanks @jackwener |
…tests for ORDER BY cases with RANGE frame (apache#8410) * fix: RANGE frame can be regularized to ROWS frame only if empty ORDER BY clause * Fix flaky test * Update test comment * Add code comment * Update
Which issue does this PR close?
Closes #.
Rationale for this change
We have regularized RANGE window frame to ROWS frame if ORDER BY clause is empty under special cases (based on the comment), but the code actually ignores non-empty ORDER BY clause which implicitly allows more than one ORDER BY columns. This case is not reflected in the code comment. This updates code comment to make it more clear.
This also adds a few tests in
sqllogictest
for variousORDER BY
cases with RANGE frame. By comparing their results with Postgres, there are a few TODOs added for the difference.What changes are included in this PR?
Updated code comment and added tests.
Are these changes tested?
Added tests to
sqllogictest
.Are there any user-facing changes?