Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses #8386

Closed
haohuaijin opened this issue Dec 1, 2023 · 7 comments · Fixed by #8559
Closed

[BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses #8386

haohuaijin opened this issue Dec 1, 2023 · 7 comments · Fixed by #8559
Assignees
Labels
bug Something isn't working

Comments

@haohuaijin
Copy link
Contributor

Describe the bug

During work on #8371, @comphead proposed that we can test partition by null and order by null

I test this and find partition by null and order by null will fail in all windows function(maybe, I'm not test all)

❯ CREATE TABLE t1 (a int) AS VALUES (1), (2), (3);
0 rows in set. Query took 0.018 seconds.

❯ select rank() over (partition by null) from t1;
thread 'tokio-runtime-worker' panicked at /home/hhj/datafusion/datafusion/physical-plan/src/repartition/mod.rs:208:79:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("must either specify a row count or at least one column")
Execution error: Expects PARTITION BY expression to be ordered

❯ select rank() over (order by null) from t1;
type_coercion
caused by
Internal error: Cannot run range queries on datatype: Null.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

❯ select sum(a) over (order by null) from t1;
type_coercion
caused by
Internal error: Cannot run range queries on datatype: Null.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

❯ select sum(a) over (partition by null) from t1;
Internal error: All partition by columns should have an ordering.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

To Reproduce

No response

Expected behavior

correct work like duckdb

Additional context

No response

@haohuaijin haohuaijin added the bug Something isn't working label Dec 1, 2023
@alamb
Copy link
Contributor

alamb commented Dec 1, 2023

This is not yet fixed (I had hoped it would be in #8371)

❯  CREATE TABLE t1 (a int) AS VALUES (1), (2), (3);
0 rows in set. Query took 0.008 seconds.

❯ select rank() over (order by null) from t1;
type_coercion
caused by
Internal error: Cannot run range queries on datatype: Null.
This was likely caused by a bug in DataFusion's code and we would welcome that you file an bug report in our issue tracker

@comphead
Copy link
Contributor

comphead commented Dec 1, 2023

The problem is a little bit wider: not only null problem, DF cannot handle scalars at all in window frame

❯ select a, rank() over (partition by 1 order by 1) from (select 1 a union all select 2 a);
Execution error: Sort operation is not applicable to scalar value 1

For PG this query is valid. I'll take a look

@comphead comphead self-assigned this Dec 1, 2023
@comphead comphead changed the title windows function don't support partition by null and order by null and cause error [BUG]: windows frame don't support scalars in ORDER BY/PARTITION BY clauses Dec 1, 2023
@comphead
Copy link
Contributor

Moving discussion from PR to the ticket

@mustafasrepo can I have your input please. The reason is the query fails

select rank() over (order by 1) rnk from (select 1 a union all select 2 a) x
Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema

The reason for that is optimize_projections removes the unused fields and this breaks the consistency. The question for you, what do you think is expected behavior for such rare case?

PS. If add a column name to projection or to ORDER BY it will expectedly work.

@mustafasrepo
Copy link
Contributor

Moving discussion from PR to the ticket

@mustafasrepo can I have your input please. The reason is the query fails

select rank() over (order by 1) rnk from (select 1 a union all select 2 a) x
Arrow error: Invalid argument error: number of columns(2) must match number of fields(1) in schema

The reason for that is optimize_projections removes the unused fields and this breaks the consistency. The question for you, what do you think is expected behavior for such rare case?

PS. If add a column name to projection or to ORDER BY it will expectedly work.

I expect this query to work. I understood the root cause of this problem (explained in the PR body). This PR solves this issue. Thanks @comphead for discovering this bug.

@mustafasrepo
Copy link
Contributor

I want to ask a question regarding the scalars inside PARTITION BY AND ORDER BY clauses. Maybe someone familiar with the standard can answer this.

Is there any difference between results of RANK() OVER() AND RANK() OVER(PARTITION BY <literal>, ORDER BY <literal>). If there is no difference maybe we can rewrite expression RANK() OVER(PARTITION BY <literal>, ORDER BY <literal>) into RANK() OVER().

@alamb
Copy link
Contributor

alamb commented Dec 15, 2023

Is there any difference between results of RANK() OVER() AND RANK() OVER(PARTITION BY , ORDER BY ). If there is no difference maybe we can rewrite expression RANK() OVER(PARTITION BY , ORDER BY ) into RANK() OVER().

I do not know of any difference in these queries

Rewriting the query sounds like an elegant solution to me.

I suggest we do the rewrite in the analyzer pass (rather than SQL planner) so such queries can also be run via the DataFrame API

@comphead
Copy link
Contributor

Thank you guys, great work to close multiple issues related to this corner case, I created another PR for PARTITION BY which has been failed before, and there likely will be followup for RANK function which somehow works not correctly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
4 participants