[SPARK-41162][SQL][3.1] Fix anti- and semi-join for self-join with aggregations #39411
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Backport #39131 to branch-3.1.
Rule
PushDownLeftSemiAntiJoinshould not push an anti-join below anAggregatewhen the join condition references an attribute that exists in its right plan and its left plan's child. This usually happens when the anti-join / semi-join is a self-join whileDeduplicateRelationscannot deduplicate those attributes (in this example due to the projection ofvaluetoid).This behaviour already exists for
ProjectandUnion, butAggregatelacks this safety guard.Why are the changes needed?
Without this change, the optimizer creates an incorrect plan.
This example fails with
distinct()(an aggregation), and succeeds withoutdistinct(), but both queries are identical:With
distinct(), rulePushDownLeftSemiAntiJoincreates a join condition(value#907 + 1) = value#907, which can never be true. This effectively removes the anti-join.Before this PR:
The anti-join is fully removed from the plan.
This is caused by
PushDownLeftSemiAntiJoinadding join condition(value#907 + 1) = value#907, which is wrong as becauseid#910in(id#910 + 1) AS id#912exists in the right child of the join as well as in the left grandchild:The right child of the join and in the left grandchild would become the children of the pushed-down join, which creates an invalid join condition.
After this PR:
Join condition
(id#910 + 1) AS id#912is understood to become ambiguous as both sides of the prospect join containid#910. Hence, the join is not pushed down. The rule is then not applied any more.The final plan contains the anti-join:
Does this PR introduce any user-facing change?
It fixes correctness.
How was this patch tested?
Unit tests in
DataFrameJoinSuiteandLeftSemiAntiJoinPushDownSuite.