GROUP-BY prioritizes input columns in case of ambiguity #9228

jonahgao · 2024-02-14T16:57:50Z

Which issue does this PR close?

Closes #9162.

Rationale for this change

When a column referenced by group-by exists both in the select list and the input, the one from the input should be given priority. In issue 9162, there are two references with the same name: one is an unqualified "t.a," and the other is a qualified t.a.

This is the practice of many databases, including PostgreSQL, Oracle, MySQL, Duckdb, etc.
In the PostgreSQL documentation, there is an explanation about it.

An expression used inside a grouping_element can be an input column name, or the name or ordinal number of an output column (SELECT list item), or an arbitrary expression formed from input-column values. In case of ambiguity, a GROUP BY name will be interpreted as an input-column name rather than an output column name.

What changes are included in this PR?

Prioritize searching the schema of the base plan when generating GROUP BY expressions.

Are these changes tested?

Yes

Are there any user-facing changes?

No

jonahgao · 2024-02-14T17:02:08Z

datafusion/sqllogictest/test_files/aggregate.slt

+
+# The column name referenced by HAVING is ambiguous
+query I
+SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;


The generation of the HAVING expression uses the same schema as GROUP BY.
The changes in this PR will also keep the result of this statement consistent with PostgreSQL.

alamb

Thank you @jonahgao -- this change makes sense to me

I believe we have seen issues related to this in InfluxDB as well.

I verified that this PR makes DataFusion consistent with postgres

postgres=# CREATE TABLE t(a BIGINT);
CREATE TABLE
postgres=# INSERT INTO t  VALUES (1), (2), (3);
INSERT 0 3
postgres=# SELECT 0 as "t.a" FROM t GROUP BY t.a;
 t.a
-----
   0
   0
   0
(3 rows)

postgres=# SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
 t.a
-----
(0 rows)

postgres=#

I also verified on main that DataFusion takes the values from the select list

andrewlamb@Andrews-MacBook-Pro:~/Software/arrow-datafusion$ datafusion-cli
DataFusion CLI v35.0.0
❯ CREATE TABLE t(a BIGINT) AS VALUES(1), (2), (3);
0 rows in set. Query took 0.011 seconds.

❯ SELECT 0 as "t.a" FROM t GROUP BY t.a;
+-----+
| t.a |
+-----+
| 0   |
+-----+
1 row in set. Query took 0.007 seconds.

❯ SELECT 0 AS "t.a" FROM t HAVING MAX(t.a) = 0;
+-----+
| t.a |
+-----+
| 0   |
+-----+
1 row in set. Query took 0.004 seconds.

datafusion/sqllogictest/test_files/aggregate.slt

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

jonahgao · 2024-02-15T14:56:06Z

Thank you @alamb for reviewing and for the suggestions.

alamb · 2024-02-16T11:25:16Z

Thanks again @jonahgao

GROUP-BY prioritizes input columns in case of ambiguity

e904217

github-actions bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Feb 14, 2024

jonahgao commented Feb 14, 2024

View reviewed changes

alamb approved these changes Feb 15, 2024

View reviewed changes

datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved

datafusion/sqllogictest/test_files/aggregate.slt Outdated Show resolved Hide resolved

jonahgao and others added 2 commits February 15, 2024 22:52

Update datafusion/sqllogictest/test_files/aggregate.slt

ce5d2a9

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

Update datafusion/sqllogictest/test_files/aggregate.slt

8b3c353

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>

alamb merged commit 40353fe into apache:main Feb 16, 2024
23 checks passed

jonahgao deleted the group-by branch February 17, 2024 04:20

jonahgao mentioned this pull request Feb 19, 2024

Add test to verify issue #9161 #9265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GROUP-BY prioritizes input columns in case of ambiguity #9228

GROUP-BY prioritizes input columns in case of ambiguity #9228

jonahgao commented Feb 14, 2024

jonahgao Feb 14, 2024

alamb left a comment

jonahgao commented Feb 15, 2024

alamb commented Feb 16, 2024

GROUP-BY prioritizes input columns in case of ambiguity #9228

GROUP-BY prioritizes input columns in case of ambiguity #9228

Conversation

jonahgao commented Feb 14, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jonahgao Feb 14, 2024

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

jonahgao commented Feb 15, 2024

alamb commented Feb 16, 2024