-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9634][SPARK-9323][SQL] cleanup unnecessary Aliases in LogicalPlan at the end of analysis #7957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @JoshRosen , this anwsers your last question at https://issues.apache.org/jira/browse/SPARK-9323 |
|
Test build #39852 has finished for PR 7957 at commit
|
|
Seems master is broken? |
|
Test build #1353 has finished for PR 7957 at commit
|
|
retest this please. |
|
I think this works, but is there a reason to not just strip all |
|
Test build #39861 has finished for PR 7957 at commit
|
|
Test build #232 has finished for PR 7957 at commit
|
|
Hi @marmbrus , we strip |
|
Why not this after analysis? plan transformAllExpressions {
case UnresolvedAlias(child) => child
case Alias(child, name) => Alias(child transform { case Alias(c, _) => c }, name)
} |
|
Test build #40120 has finished for PR 7957 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this duplicated with ResolveAliases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I made a mistake here, we should just remove the UnresolvedAlias, just like what we did in Analyzer.ResolveReferences(we call trimUnresolvedAlias there).
The key problem is we have 2 code path to resolve UnresolvedAttribute, one is Analyzer.ResolveReferences(last case) and the other is DataFrame.resolve, so I think we need to make them consistent, i.e. call trimUnresolvedAlias at DataFrame.resolve or abstract this logic for sql and dataframe.
|
Thanks for continuing to work on this :) |
|
cc @marmbrus , I think this solution is better :) |
|
Test build #40480 has finished for PR 7957 at commit
|
|
Test build #40487 has finished for PR 7957 at commit
|
|
retest this please. |
|
Test build #40581 has finished for PR 7957 at commit
|
|
Test build #1483 has finished for PR 7957 at commit
|
|
retest this please. |
|
Test build #40634 has finished for PR 7957 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test CreateStructUnsafe too?
|
LGTM |
|
Test build #1496 has finished for PR 7957 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @rxin , in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call withColumn and in withColumn we alias this clolumn again. Here I added a new parameter to allow user set metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do this in a different PR? Also we should do it without using Option and default arguments so that it works well in Java.
|
Test build #40726 has finished for PR 7957 at commit
|
|
Test build #40761 has finished for PR 7957 at commit
|
|
retest this please. |
|
Test build #40764 has finished for PR 7957 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove everything except for this and the related tests? I'd like to pull this into the release branch without new features.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have opened #8159 to improve the withColumn, but left the code here to see if we can pass the tests.
This PR did 2 things:
- use
Aliasinstead ofUnresolvedAliaswhen resolve nested column inLogicalPlan.resolve - clean unnecessary aliases at the end of analysis
If we only do 1, some tests will fail as we need to trim aliases in the middle of getField chain. If we only do 2, it can't fix any bugs. So I put them together here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've open #8215 which is basically your patch without the mllib changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this release, but this makes me think that we are abusing aliases. I would rather that resolved expressions past the analyzer move the names out of the subexpressions and into the CreateStruct expression itself.
|
closing in favor of #8215 |
…lPlan at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #8215 from marmbrus/pr/7957. (cherry picked from commit ec29f20) Signed-off-by: Reynold Xin <rxin@databricks.com>
…lPlan at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on #7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #8215 from marmbrus/pr/7957.
…lPlan at the end of analysis Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary. Based on apache#7957 without the changes to mllib, but instead maintaining earlier behavior when using `withColumn` on expressions that already have metadata. Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes apache#8215 from marmbrus/pr/7957.
Also alias the ExtractValue instead of wrapping it with UnresolvedAlias when resolve attribute in LogicalPlan, as this alias will be trimmed if it's unnecessary.
This also fixes https://issues.apache.org/jira/browse/SPARK-9323 as
DataFrame.resolvewon't return unresolved expression now.