Skip to content

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Apr 19, 2021

What changes were proposed in this pull request?

This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490

The query below has different output schemas in 3.0 and 3.1

sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))

In 3.0 the output column name is col1, in 3.1 it's s.col1. This breaks existing queries.

In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column s.col1 will become UnresolvedAlias(s.col1, None). In ResolveReference, the logic used to directly resolve s.col to s.col1 AS col1 but after #28490 we enter the code path with trimAlias = true and !isTopLevel, so the alias is removed and resulting in s.col1, which will then be resolved in ResolveAliases as s.col1 AS s.col1

#31758 happens to fix this issue because we no longer wrap UnresolvedAttribute with UnresolvedAlias in RelationalGroupedDataset.

Why are the changes needed?

Fix an unexpected query output schema change

Does this PR introduce any user-facing change?

Yes as explained above.

How was this patch tested?

updated test

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42170/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42170/

@SparkQA
Copy link

SparkQA commented Apr 19, 2021

Test build #137640 has finished for PR 32239 at commit 55d95d7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan cloud-fan changed the title [SPARK-34639][SQL][3.1] Always remove unnecessary Alias in Analyzer.resolveExpressionTopDown [SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias Apr 20, 2021
@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42209/

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42209/

@SparkQA
Copy link

SparkQA commented Apr 20, 2021

Test build #137681 has finished for PR 32239 at commit 63cac4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

cc @AngersZhuuuu @maropu @viirya

Row("C", null, 3) :: Row("C", "{\"i\": 1}", 3) :: Nil)

assert(spark.table("t").groupBy($"c.json_string").count().schema.fieldNames ===
Seq("json_string", "count"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's safer for branch-3.0 to have this test, too, I think.

maropu pushed a commit that referenced this pull request Apr 21, 2021
…ate UnresolvedAlias

### What changes were proposed in this pull request?

This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490

The query below has different output schemas in 3.0 and 3.1
```
sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))
```

In 3.0 the output column name is `col1`, in 3.1  it's `s.col1`. This breaks existing queries.

In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1`

#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`.

### Why are the changes needed?

Fix an unexpected query output schema change

### Does this PR introduce _any_ user-facing change?

Yes as explained above.

### How was this patch tested?

updated test

Closes #32239 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
@maropu
Copy link
Member

maropu commented Apr 21, 2021

Merged to branch-3.1. Thank you.

@maropu maropu closed this Apr 21, 2021
Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm too

flyrain pushed a commit to flyrain/spark that referenced this pull request Sep 21, 2021
…ate UnresolvedAlias

### What changes were proposed in this pull request?

This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490

The query below has different output schemas in 3.0 and 3.1
```
sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))
```

In 3.0 the output column name is `col1`, in 3.1  it's `s.col1`. This breaks existing queries.

In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1`

apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`.

### Why are the changes needed?

Fix an unexpected query output schema change

### Does this PR introduce _any_ user-facing change?

Yes as explained above.

### How was this patch tested?

updated test

Closes apache#32239 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
fishcus pushed a commit to fishcus/spark that referenced this pull request Jan 12, 2022
…ate UnresolvedAlias

### What changes were proposed in this pull request?

This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490

The query below has different output schemas in 3.0 and 3.1
```
sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s"))
```

In 3.0 the output column name is `col1`, in 3.1  it's `s.col1`. This breaks existing queries.

In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1`

apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`.

### Why are the changes needed?

Fix an unexpected query output schema change

### Does this PR introduce _any_ user-facing change?

Yes as explained above.

### How was this patch tested?

updated test

Closes apache#32239 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants