[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

cloud-fan · 2021-04-19T16:42:11Z

What changes were proposed in this pull request?

This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490

The query below has different output schemas in 3.0 and 3.1

sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))

In 3.0 the output column name is col1, in 3.1 it's s.col1. This breaks existing queries.

In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column s.col1 will become UnresolvedAlias(s.col1, None). In ResolveReference, the logic used to directly resolve s.col to s.col1 AS col1 but after #28490 we enter the code path with trimAlias = true and !isTopLevel, so the alias is removed and resulting in s.col1, which will then be resolved in ResolveAliases as s.col1 AS s.col1

#31758 happens to fix this issue because we no longer wrap UnresolvedAttribute with UnresolvedAlias in RelationalGroupedDataset.

Why are the changes needed?

Fix an unexpected query output schema change

Does this PR introduce any user-facing change?

Yes as explained above.

How was this patch tested?

updated test

SparkQA · 2021-04-19T17:57:07Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42170/

SparkQA · 2021-04-19T17:57:08Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42170/

SparkQA · 2021-04-19T19:36:36Z

Test build #137640 has finished for PR 32239 at commit 55d95d7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-20T10:53:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42209/

SparkQA · 2021-04-20T10:53:24Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42209/

SparkQA · 2021-04-20T14:00:05Z

Test build #137681 has finished for PR 32239 at commit 63cac4d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-04-20T14:35:41Z

cc @AngersZhuuuu @maropu @viirya

maropu · 2021-04-21T00:13:12Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

          Row("C", null, 3) :: Row("C", "{\"i\": 1}", 3) :: Nil)
+
+      assert(spark.table("t").groupBy($"c.json_string").count().schema.fieldNames ===
+        Seq("json_string", "count"))


nit: It's safer for branch-3.0 to have this test, too, I think.

…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` #31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes #32239 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

maropu · 2021-04-21T00:14:38Z

Merged to branch-3.1. Thank you.

viirya

lgtm too

…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

github-actions bot added the SQL label Apr 19, 2021

cloud-fan force-pushed the bug branch from e751425 to 55d95d7 Compare April 19, 2021 16:58

Always remove unnecessary Alias in Analyzer.resolveExpressionTopDown

63cac4d

cloud-fan force-pushed the bug branch from 55d95d7 to 63cac4d Compare April 20, 2021 09:27

cloud-fan changed the title ~~[SPARK-34639][SQL][3.1] Always remove unnecessary Alias in Analyzer.resolveExpressionTopDown~~ [SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias Apr 20, 2021

maropu reviewed Apr 21, 2021

View reviewed changes

maropu approved these changes Apr 21, 2021

View reviewed changes

maropu closed this Apr 21, 2021

viirya reviewed Apr 21, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

Uh oh!

cloud-fan commented Apr 19, 2021 •

edited

Loading

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

cloud-fan commented Apr 20, 2021

Uh oh!

maropu Apr 21, 2021

Uh oh!

maropu commented Apr 21, 2021

Uh oh!

viirya left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239

Uh oh!

Conversation

cloud-fan commented Apr 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 19, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

SparkQA commented Apr 20, 2021

Uh oh!

cloud-fan commented Apr 20, 2021

Uh oh!

maropu Apr 21, 2021

Choose a reason for hiding this comment

Uh oh!

maropu commented Apr 21, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cloud-fan commented Apr 19, 2021 •

edited

Loading