-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GROUP-BY prioritizes input columns in case of ambiguity #9228
Conversation
|
||
# The column name referenced by HAVING is ambiguous | ||
query I | ||
SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generation of the HAVING expression uses the same schema as GROUP BY.
The changes in this PR will also keep the result of this statement consistent with PostgreSQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jonahgao -- this change makes sense to me
I believe we have seen issues related to this in InfluxDB as well.
I verified that this PR makes DataFusion consistent with postgres
postgres=# CREATE TABLE t(a BIGINT);
CREATE TABLE
postgres=# INSERT INTO t VALUES (1), (2), (3);
INSERT 0 3
postgres=# SELECT 0 as "t.a" FROM t GROUP BY t.a;
t.a
-----
0
0
0
(3 rows)
postgres=# SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
t.a
-----
(0 rows)
postgres=#
I also verified on main that DataFusion takes the values from the select list
andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ datafusion-cli
DataFusion CLI v35.0.0
❯ CREATE TABLE t(a BIGINT) AS VALUES(1), (2), (3);
0 rows in set. Query took 0.011 seconds.
❯ SELECT 0 as "t.a" FROM t GROUP BY t.a;
+-----+
| t.a |
+-----+
| 0 |
+-----+
1 row in set. Query took 0.007 seconds.
❯ SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
+-----+
| t.a |
+-----+
| 0 |
+-----+
1 row in set. Query took 0.004 seconds.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Thank you @alamb for reviewing and for the suggestions. |
Thanks again @jonahgao |
Which issue does this PR close?
Closes #9162.
Rationale for this change
When a column referenced by group-by exists both in the select list and the input, the one from the input should be given priority. In issue 9162, there are two references with the same name: one is an unqualified "t.a," and the other is a qualified t.a.
This is the practice of many databases, including PostgreSQL, Oracle, MySQL, Duckdb, etc.
In the PostgreSQL documentation, there is an explanation about it.
What changes are included in this PR?
Prioritize searching the schema of the base plan when generating GROUP BY expressions.
Are these changes tested?
Yes
Are there any user-facing changes?
No