-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-34639][SQL][3.1] RelationalGroupedDataset.alias should not create UnresolvedAlias #32239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137640 has finished for PR 32239 at commit
|
|
Kubernetes integration test starting |
|
Kubernetes integration test status failure |
|
Test build #137681 has finished for PR 32239 at commit
|
| Row("C", null, 3) :: Row("C", "{\"i\": 1}", 3) :: Nil) | ||
|
|
||
| assert(spark.table("t").groupBy($"c.json_string").count().schema.fieldNames === | ||
| Seq("json_string", "count")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's safer for branch-3.0 to have this test, too, I think.
…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after #28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` #31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes #32239 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
|
Merged to branch-3.1. Thank you. |
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm too
…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…ate UnresolvedAlias ### What changes were proposed in this pull request? This PR partially backports apache#31758 to 3.1, to fix a backward compatibility issue caused by apache#28490 The query below has different output schemas in 3.0 and 3.1 ``` sql("select struct(1, 2) as s").groupBy(col("s.col1")).agg(first("s")) ``` In 3.0 the output column name is `col1`, in 3.1 it's `s.col1`. This breaks existing queries. In apache#28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column `s.col1` will become `UnresolvedAlias(s.col1, None)`. In `ResolveReference`, the logic used to directly resolve `s.col` to `s.col1 AS col1` but after apache#28490 we enter the code path with `trimAlias = true and !isTopLevel`, so the alias is removed and resulting in `s.col1`, which will then be resolved in `ResolveAliases` as `s.col1 AS s.col1` apache#31758 happens to fix this issue because we no longer wrap `UnresolvedAttribute` with `UnresolvedAlias` in `RelationalGroupedDataset`. ### Why are the changes needed? Fix an unexpected query output schema change ### Does this PR introduce _any_ user-facing change? Yes as explained above. ### How was this patch tested? updated test Closes apache#32239 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
What changes were proposed in this pull request?
This PR partially backports #31758 to 3.1, to fix a backward compatibility issue caused by #28490
The query below has different output schemas in 3.0 and 3.1
In 3.0 the output column name is
col1, in 3.1 it'ss.col1. This breaks existing queries.In #28490 , we changed the logic of resolving aggregate expressions. What happened is that the input nested column
s.col1will becomeUnresolvedAlias(s.col1, None). InResolveReference, the logic used to directly resolves.coltos.col1 AS col1but after #28490 we enter the code path withtrimAlias = true and !isTopLevel, so the alias is removed and resulting ins.col1, which will then be resolved inResolveAliasesass.col1 AS s.col1#31758 happens to fix this issue because we no longer wrap
UnresolvedAttributewithUnresolvedAliasinRelationalGroupedDataset.Why are the changes needed?
Fix an unexpected query output schema change
Does this PR introduce any user-facing change?
Yes as explained above.
How was this patch tested?
updated test