-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4322][SQL] Enables struct fields as sub expressions of grouping fields #3248
Conversation
Test build #23314 has started for PR 3248 at commit
|
Test build #23314 has finished for PR 3248 at commit
|
Test FAILed. |
Test build #23365 has started for PR 3248 at commit
|
Only aliases around |
Test build #23366 has started for PR 3248 at commit
|
Test build #23365 has finished for PR 3248 at commit
|
Test PASSed. |
Test build #23366 has finished for PR 3248 at commit
|
Test PASSed. |
!isValidAggregateExpression(e.transform { | ||
// Should trim aliases around `GetField`s. These aliases are introduced while | ||
// resolving struct field accesses, because `GetField` is not a `NamedExpression`. | ||
// (Should we just turn `GetField` into a `NamedExpression`?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An earlier version of catalyst actually did something similar to having GetField
be a named expression. However, this complicated a lot of things. In particular, it made recursive schemas impossible and it is a little tricky to have non-leaf nodes that have to have expression ids.
That said, I'm not opposed to this idea in general, but I think it might be a fair amount of work to realize. We should possible evaluate a few options as the current situation also seems less than ideal.
I'd propose we merge this now to fix the bug and investigate after 1.2.
Thanks for working on this! I went ahead and merged to master and 1.2, but lets revisit the idea of what to do about |
…g fields While resolving struct fields, the resulted `GetField` expression is wrapped with an `Alias` to make it a named expression. Assume `a` is a struct instance with a field `b`, then `"a.b"` will be resolved as `Alias(GetField(a, "b"), "b")`. Thus, for this following SQL query: ```sql SELECT a.b + 1 FROM t GROUP BY a.b + 1 ``` the grouping expression is ```scala Add(GetField(a, "b"), Literal(1, IntegerType)) ``` while the aggregation expression is ```scala Add(Alias(GetField(a, "b"), "b"), Literal(1, IntegerType)) ``` This mismatch makes the above SQL query fail during the both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3248) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3248 from liancheng/spark-4322 and squashes the following commits: 23a46ea [Cheng Lian] Code simplification dd20a79 [Cheng Lian] Should only trim aliases around `GetField`s 7f46532 [Cheng Lian] Enables struct fields as sub expressions of grouping fields (cherry picked from commit 0c7b66b) Signed-off-by: Michael Armbrust <michael@databricks.com>
…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue #3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue #4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue #3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue #3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue #4075](h2database/h2database#4075): infinite loop in compact - [Issue #4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue #4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR #3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR #3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR #3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR #3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR #3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR #3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR #3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue #3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR #3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue #3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR #3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR #2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR #2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR #2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
…42.7.4 and `mssql` to 12.8.1.jre11 ### What changes were proposed in this pull request? This PR aims to upgrade `h2` to 2.3.232, `postgresql` to 42.7.4 and `mssql` to 12.8.1.jre11. ### Why are the changes needed? 1. For `h2`, there are some issues fixed in version 2.3.232(full release notes: https://www.h2database.com/html/changelog.html): - [Issue apache#3945](h2database/h2database#3945): Column not found in correlated subquery, when referencing outer column from LEFT JOIN .. ON clause - [Issue apache#4097](h2database/h2database#4097): StackOverflowException when using multiple SELECT statements in one query (2.3.230) - [Issue apache#3982](h2database/h2database#3982): Potential issue when using ROUND - [Issue apache#3894](h2database/h2database#3894): Race condition causing stale data in query last result cache - [Issue apache#4075](h2database/h2database#4075): infinite loop in compact - [Issue apache#4091](h2database/h2database#4091): Wrong case with linked table to postgresql - [Issue apache#4088](h2database/h2database#4088): BadGrammarException when the same alias is used within two different CTEs 2. For `postgresql`, there are some issues fixed and improvements in version 42.7.4(full release notes: https://jdbc.postgresql.org/changelogs/2024-08-22-42.7.4-release/): - fix: PgInterval ignores case for represented interval string [PR apache#3344](pgjdbc/pgjdbc#3344) - perf: Avoid extra copies when receiving int4 and int2 in PGStream [PR apache#3295](pgjdbc/pgjdbc#3295) - fix: Add support for Infinity::numeric values in ResultSet.getObject [PR apache#3304](pgjdbc/pgjdbc#3304) - fix: Ensure order of results for getDouble [PR apache#3301](pgjdbc/pgjdbc#3301) - perf: Replace BufferedOutputStream with unsynchronized PgBufferedOutputStream, allow configuring different Java and SO_SNDBUF buffer sizes [PR apache#3248](pgjdbc/pgjdbc#3248) - fix: Fix SSL tests [PR apache#3260](pgjdbc/pgjdbc#3260) - fix: Support bytea in preferQueryMode=simple [PR apache#3243](pgjdbc/pgjdbc#3243) - fix: Fix [Issue apache#3234](pgjdbc/pgjdbc#3234) - Return -1 as update count for stored procedure calls [PR apache#3235](pgjdbc/pgjdbc#3235) - fix: Fix [Issue apache#3224](pgjdbc/pgjdbc#3224) - conversion for TIME ‘24:00’ to LocalTime breaks in binary-mode [PR apache#3225](pgjdbc/pgjdbc#3225) 3. For `mssql`, there are some issues fixed in 12.8.1.jre11(full release notes: https://github.com/microsoft/mssql-jdbc/releases/tag/v12.8.1): - Adjusted DESTINATION_COL_METADATA_LOCK, in SQLServerBulkCopy, so that is properly released in all cases [PR apache#2492](microsoft/mssql-jdbc#2492) - Reverted "Execute Stored Procedures Directly" feature, as well as subsequent changes related to the feature [PR apache#2493](microsoft/mssql-jdbc#2493) - Changed driver behavior to allow prepared statement objects to be reused, preventing a "multiple queries are not allowed" error [PR apache#2494](microsoft/mssql-jdbc#2494) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47810 from wayneguow/ug_h2. Authored-by: Wei Guo <guow93@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>
While resolving struct fields, the resulted
GetField
expression is wrapped with anAlias
to make it a named expression. Assumea
is a struct instance with a fieldb
, then"a.b"
will be resolved asAlias(GetField(a, "b"), "b")
. Thus, for this following SQL query:the grouping expression is
while the aggregation expression is
This mismatch makes the above SQL query fail during both analysis and execution phases. This PR fixes this issue by removing the alias when substituting aggregation expressions.