Skip to content

Conversation

@dilipbiswal
Copy link
Contributor

@dilipbiswal dilipbiswal commented Jul 26, 2018

What changes were proposed in this pull request?

Implements INTERSECT ALL clause through query rewrites using existing operators in Spark. Please refer to Link for the design.

Input Query

SELECT c1 FROM ut1 INTERSECT ALL SELECT c1 FROM ut2

Rewritten Query

   SELECT c1
    FROM (
         SELECT replicate_row(min_count, c1)
         FROM (
              SELECT c1,
                     IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt) AS min_count
              FROM (
                   SELECT   c1, count(vcol1) as vcol1_cnt, count(vcol2) as vcol2_cnt
                   FROM (
                        SELECT c1, true as vcol1, null as vcol2 FROM ut1
                        UNION ALL
                        SELECT c1, null as vcol1, true as vcol2 FROM ut2
                        ) AS union_all
                   GROUP BY c1
                   HAVING vcol1_cnt >= 1 AND vcol2_cnt >= 1
                  )
              )
          )

How was this patch tested?

Added test cases in SQLQueryTestSuite, DataFrameSuite, SetOperationSuite

@viirya
Copy link
Member

viirya commented Jul 26, 2018

Typo in the PR description: IF (vcol1_cnt > vcol1_cnt, vcol2_cnt, vcol1_cnt) -> IF (vcol1_cnt > vcol2_cnt, vcol2_cnt, vcol1_cnt).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo here vcol1_cnt > vcol1_cnt -> vcol1_cnt > vcol2_cnt.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have vcol1_cnt and vcol2_cnt here? I think above replicate_row only takes min_count input.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Thanks !! No we don't. In the actual code, we don't project these columns out. I will fix the doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to add resolves columns by position (not by name)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya OK

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: bothe.

frame -> dataframe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya Will fix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the implementation, I think this should be:

SELECT true as vcol1, null as vcol2, c1 FROM ut1
UNION ALL
SELECT null as vcol1, true as vcol2, c1 FROM ut2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya OK :-)

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93611 has finished for PR 21886 at commit 1039e47.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class Intersect(

@dilipbiswal
Copy link
Contributor Author

@gatorsmile I see this failure in other PRs as well. Is this introduced by some recent changes ?

@viirya
Copy link
Member

viirya commented Jul 26, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Jul 26, 2018

Test build #93617 has finished for PR 21886 at commit 6392469.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93629 has finished for PR 21886 at commit 6392469.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93642 has finished for PR 21886 at commit 7268736.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93651 has finished for PR 21886 at commit 7268736.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dilipbiswal dilipbiswal force-pushed the dkb_intersect_all_final branch from 7268736 to bfe7030 Compare July 27, 2018 16:34
@SparkQA
Copy link

SparkQA commented Jul 27, 2018

Test build #93677 has finished for PR 21886 at commit bfe7030.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @dilipbiswal Could you resolve the conflicts? I will start the review after the rebase.

@dilipbiswal dilipbiswal force-pushed the dkb_intersect_all_final branch from bfe7030 to 67b15ee Compare July 28, 2018 01:06
@dilipbiswal
Copy link
Contributor Author

@gatorsmile Rebased.

@SparkQA
Copy link

SparkQA commented Jul 28, 2018

Test build #93708 has finished for PR 21886 at commit 67b15ee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @since 2.4.0
*/
def intersectAll(other: Dataset[T]): Dataset[T] = withSetOperator {
Intersect(planWithBarrier, other.planWithBarrier, isAll = true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you use logicalPlan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Sure.. how about exceptAll that was checked in today ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Please do it too.

-- !query 8
SELECT c1 FROM tab1
INTERSECT ALL
SELECT c1, c2 FROM tab2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use k and v


-- !query 1
CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES
(1, 2),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add another duplicate rows for (1, 2);

CREATE TEMPORARY VIEW tab1 AS SELECT * FROM VALUES
(1, 2),
(1, 2),
(1, 3),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add another duplicate row (1, 3)

-- !query 1
CREATE TEMPORARY VIEW tab2 AS SELECT * FROM VALUES
(1, 2),
(2, 3)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add one more row (3, 4)

@gatorsmile
Copy link
Member

The code looks good to me. Let us improve the test cases.

@dilipbiswal
Copy link
Contributor Author

@gatorsmile Thank you.. I will make the changes.

@SparkQA
Copy link

SparkQA commented Jul 28, 2018

Test build #93725 has finished for PR 21886 at commit 8ba4b71.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

1 2
2 3
NULL NULL
NULL NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This misses one row (1, 3). Could you investigate the cause?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Thank you.. I just went over my notes. The reason for the difference in output is because in Spark we give the same precedence to to all the set operators. The operators are basically evaluated in the order they appear in the query from left to right. But per standard, INTERSECT should have higher precedence over UNION and EXCEPT. We do have this problem in our current support of EXCEPT (DISTINCT) and INTERSECT (DISTINCT). I am fixing the test now to add parenthesize around the query block to force certain order of evaluation. I have opened https://issues.apache.org/jira/browse/SPARK-24966 to work in fixing the precedence in our grammer.

@gatorsmile
Copy link
Member

LGTM pending Jenkins

case class Intersect(
left: LogicalPlan,
right: LogicalPlan,
isAll: Boolean = false) extends SetOperation(left, right) {
Copy link
Member

@HyukjinKwon HyukjinKwon Jul 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal at all but this has three spaces ..

"logical intersect operator should have been replaced by semi-join in the optimizer")
case logical.Intersect(left, right, true) =>
throw new IllegalStateException(
"logical intersect operator should have been replaced by union, aggregate" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: looks we need a space for aggregate" -> aggregate "

@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93760 has finished for PR 21886 at commit 5d5461a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@dilipbiswal Please address the style issues in your other PRs.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@gatorsmile
Copy link
Member

@dilipbiswal The merged PR does not pick up your last commit.

@dilipbiswal
Copy link
Contributor Author

dilipbiswal commented Jul 30, 2018

@gatorsmile Ok Sean.. I will correct in next PR. Thank you very very much.

@asfgit asfgit closed this in 65a4bc1 Jul 30, 2018
@SparkQA
Copy link

SparkQA commented Jul 30, 2018

Test build #93769 has finished for PR 21886 at commit 89d03af.

  • This patch fails due to an unknown error code, -9.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants