-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14191][SQL] Remove invalid Expand operator constraints #11995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #54301 has finished for PR 11995 at commit
|
| child: LogicalPlan) extends UnaryNode { | ||
|
|
||
| child: LogicalPlan, | ||
| constraintsBase: Seq[Expression]) extends UnaryNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter is not documented.
Also, this operator is kind of violating the style of logical operators. The goal here is for query nodes to be correct by construction (i.e. figure out things like constraints by inspecting themselves and their children).
It would be great if we could move constraint pruning logic into Expand.
|
Test build #54495 has finished for PR 11995 at commit
|
|
@marmbrus I've updated this and tests passed. Please take a look. Thanks! |
| child: LogicalPlan) extends UnaryNode { | ||
|
|
||
| child: LogicalPlan, | ||
| groupByAttrs: Seq[Attribute]) extends UnaryNode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This still feels kind of hacky to me and I think it comes down to the fact that this operator is poorly designed. Since we use this operator for more than just grouping, it seems kind of odd to add a new parameter called groupByAttrs to it. Also, we still have the property that its not always going to be correct by construction. If a user of this class incorrectly sets the groupByAttrs everything will still work, but it will just lie about its constraints.
I think the root problem is that projections and output are defined separately in the constructor. Everywhere else in logical plans, you either have an AttributeReference or, if you are producing a new value, you have an Alias. When you follow this pattern, the constraint system just works.
However, in Expand we have logic that decides to replace a column with null (which should be a new AttributeReference), but instead we impersonate the original value.
Until we come up with a principled solution, maybe we should just set validConstraints to be empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing the problem out. Agreed after re-thinking about it. As separating projections and output causes the problem. How about we get the output from projections?
As there are more than one projection, we can just get the output from the first projection and verify its consistency with other projections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I do think it would be good if it just took a Seq[Seq[NamedExpression]] (or at least I can't come up with anything better). I'd still consider breaking this into two PRs. Simple fix for now that just removes invalid constraints and a refactoring that add back in valid ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let me remove the constraints first.
|
Test build #54601 has finished for PR 11995 at commit
|
|
ping @marmbrus |
| child = a.copy(aggregateExpressions = a.aggregateExpressions.filter(p.references.contains))) | ||
| case a @ Project(_, e @ Expand(_, _, grandChild)) if (e.outputSet -- a.references).nonEmpty => | ||
| case a @ Project(_, e @ Expand(_, _, grandChild)) | ||
| if (e.outputSet -- a.references).nonEmpty => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is best to avoid spurious changes because it pollutes git blame. I can revert this while merging this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Got it.
|
Thanks, merging to master. |
| Statistics(sizeInBytes = sizeInBytes) | ||
| } | ||
|
|
||
| override protected def validConstraints: Set[Expression] = Set.empty[Expression] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also going to add a comment here to explain why this is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
What changes were proposed in this pull request?
JIRA: https://issues.apache.org/jira/browse/SPARK-14191
Expandoperator now uses its child plan's constraints as its valid constraints (i.e., the base of constraints). This is not correct becauseExpandwill set its group by attributes to null values. So the nullability of these attributes should be true.E.g., for an
Expandoperator like:The
Projectoperator has the constraintsIsNotNull('a),IsNotNull('b)andIsNotNull('c). But theExpandshould not haveIsNotNull('a)in its constraints.This PR is the first step for this issue and remove invalid constraints of
Expandoperator.How was this patch tested?
A test is added to
ConstraintPropagationSuite.