-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21274][SQL] Implement INTERSECT ALL clause #21886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Typo in the PR description: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo here vcol1_cnt > vcol1_cnt -> vcol1_cnt > vcol2_cnt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have vcol1_cnt and vcol2_cnt here? I think above replicate_row only takes min_count input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Thanks !! No we don't. In the actual code, we don't project these columns out. I will fix the doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to add resolves columns by position (not by name)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya OK
python/pyspark/sql/dataframe.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: bothe.
frame -> dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya Will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the implementation, I think this should be:
SELECT true as vcol1, null as vcol2, c1 FROM ut1
UNION ALL
SELECT null as vcol1, true as vcol2, c1 FROM ut2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya OK :-)
|
Test build #93611 has finished for PR 21886 at commit
|
|
@gatorsmile I see this failure in other PRs as well. Is this introduced by some recent changes ? |
|
retest this please. |
|
Test build #93617 has finished for PR 21886 at commit
|
|
Test build #93629 has finished for PR 21886 at commit
|
|
Test build #93642 has finished for PR 21886 at commit
|
|
retest this please |
|
Test build #93651 has finished for PR 21886 at commit
|
7268736 to
bfe7030
Compare
|
Test build #93677 has finished for PR 21886 at commit
|
|
cc @dilipbiswal Could you resolve the conflicts? I will start the review after the rebase. |
bfe7030 to
67b15ee
Compare
|
@gatorsmile Rebased. |
|
Test build #93708 has finished for PR 21886 at commit
|
| * @since 2.4.0 | ||
| */ | ||
| def intersectAll(other: Dataset[T]): Dataset[T] = withSetOperator { | ||
| Intersect(planWithBarrier, other.planWithBarrier, isAll = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you use logicalPlan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Sure.. how about exceptAll that was checked in today ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Please do it too.
| -- !query 8 | ||
| SELECT c1 FROM tab1 | ||
| INTERSECT ALL | ||
| SELECT c1, c2 FROM tab2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use k and v
|
|
||
| -- !query 1 | ||
| CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES | ||
| (1, 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add another duplicate rows for (1, 2);
| CREATE TEMPORARY VIEW tab1 AS SELECT * FROM VALUES | ||
| (1, 2), | ||
| (1, 2), | ||
| (1, 3), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also add another duplicate row (1, 3)
| -- !query 1 | ||
| CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES | ||
| (1, 2), | ||
| (2, 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add one more row (3, 4)
|
The code looks good to me. Let us improve the test cases. |
|
@gatorsmile Thank you.. I will make the changes. |
|
Test build #93725 has finished for PR 21886 at commit
|
| 1 2 | ||
| 2 3 | ||
| NULL NULL | ||
| NULL NULL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This misses one row (1, 3). Could you investigate the cause?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile Thank you.. I just went over my notes. The reason for the difference in output is because in Spark we give the same precedence to to all the set operators. The operators are basically evaluated in the order they appear in the query from left to right. But per standard, INTERSECT should have higher precedence over UNION and EXCEPT. We do have this problem in our current support of EXCEPT (DISTINCT) and INTERSECT (DISTINCT). I am fixing the test now to add parenthesize around the query block to force certain order of evaluation. I have opened https://issues.apache.org/jira/browse/SPARK-24966 to work in fixing the precedence in our grammer.
|
LGTM pending Jenkins |
| case class Intersect( | ||
| left: LogicalPlan, | ||
| right: LogicalPlan, | ||
| isAll: Boolean = false) extends SetOperation(left, right) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal at all but this has three spaces ..
| "logical intersect operator should have been replaced by semi-join in the optimizer") | ||
| case logical.Intersect(left, right, true) => | ||
| throw new IllegalStateException( | ||
| "logical intersect operator should have been replaced by union, aggregate" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: looks we need a space for aggregate" -> aggregate "
|
Test build #93760 has finished for PR 21886 at commit
|
|
@dilipbiswal Please address the style issues in your other PRs. |
|
Thanks! Merged to master. |
|
@dilipbiswal The merged PR does not pick up your last commit. |
|
@gatorsmile Ok Sean.. I will correct in next PR. Thank you very very much. |
|
Test build #93769 has finished for PR 21886 at commit
|
What changes were proposed in this pull request?
Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to Link for the design.
Input Query
Rewritten Query
How was this patch tested?
Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite