[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29291

beliefer · 2020-07-29T14:31:10Z

What changes were proposed in this pull request?

This PR is related to #26656.
#26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:

select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;

Why are the changes needed?

Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Exists and new UT

SparkQA · 2020-07-29T15:47:43Z

Test build #126778 has finished for PR 29291 at commit 4ba808b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-30T06:33:44Z

Test build #126797 has finished for PR 29291 at commit 145a9dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-30T08:56:34Z

cc @cloud-fan

cloud-fan · 2020-07-30T10:54:22Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+    // group without filter clause.
    // This check can produce false-positives, e.g., SUM(DISTINCT a) & COUNT(DISTINCT a).
-    distinctAggs.size > 1
+    distinctAggs.size > 1 || (distinctAggs.size == 1 && aggExpressions.exists(_.filter.isDefined))


We can remove distinctAggs.size == 1, as it's indicarted by distinctAggs.size > 1 || ...

If distinctAggs.size == 0 and aggExpressions.exists(_.filter.isDefined), we not need this rewrite.
The normal agg with filter could treated by physical plan.

shouldn't it be distinctAggs.exists(_.filter.isDefined)?

shall we match https://github.com/apache/spark/pull/29291/files#diff-29e82df7487a97f879691c1b525709aeR231 ?

OK. I got it.

cloud-fan · 2020-07-30T10:59:20Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+      // Setup all the filters in distinct aggregate.
+      val distinctAggExprs = aggExpressions
+        .filter(e => e.isDistinct && e.children.exists(!_.foldable))
+      val distinctAggFilterAttrMap = distinctAggExprs.collect {


nit: val (distinctAggFilters, distinctAggFilterAttrs, maxCond) = distinctAggExprs.collect(...).unzip3

But I want

val distinctAggFilterAttrLookup = distinctAggFilterAttrMap.map { tuple3 => tuple3._1 -> tuple3._3.toAttribute }.toMap

this is distinctAggFilters.zip(maxCond.map(_.toAttribute)).toMap

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

SparkQA · 2020-07-30T23:35:40Z

Test build #126810 has finished for PR 29291 at commit 7362dfb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-31T02:10:08Z

retest this please

cloud-fan · 2020-07-31T03:26:25Z

can you rebase/merge with the master branch to get the github action fix? The jenkin is quite unstable now and we may need to rely on github actions

beliefer · 2020-07-31T03:54:14Z

can you rebase/merge with the master branch to get the github action fix? The jenkin is quite unstable now and we may need to rely on github actions

OK

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

SparkQA · 2020-07-31T07:05:03Z

Test build #126843 has finished for PR 29291 at commit 7362dfb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

cloud-fan · 2020-07-31T07:28:08Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

+      val (distinctAggFilters, distinctAggFilterAttrs, maxConds) = distinctAggExprs.collect {
+        case AggregateExpression(_, _, _, filter, _) if filter.isDefined =>
+          val (e, attr) = expressionAttributePair(filter.get)
+          val aggregateExp = AggregateExpression(Max(attr), Partial, false)


nit: Max(attr).toAggregateExpression(distinct = false)

cloud-fan · 2020-07-31T07:39:22Z

sql/core/src/test/resources/sql-tests/inputs/group-by-filter.sql

+SELECT COUNT(DISTINCT id), COUNT(DISTINCT id) FILTER (WHERE date_format(hiredate, "yyyy-MM-dd HH:mm:ss") = "2001-01-01 00:00:00") FROM emp;
+SELECT COUNT(DISTINCT id) FILTER (WHERE hiredate = to_timestamp("2001-01-01 00:00:00")), COUNT(DISTINCT id) FILTER (WHERE hiredate = to_date('2001-01-01 00:00:00')) FROM emp;
+SELECT SUM(salary), COUNT(DISTINCT id), COUNT(DISTINCT id) FILTER (WHERE hiredate = date "2001-01-01") FROM emp;
+SELECT COUNT(DISTINCT 1) FILTER (WHERE a = 1) FROM testData;


can we also test COUNT(DISTINCT id) FILTER (WHERE true) and COUNT(DISTINCT id) FILTER (WHERE false)?

SparkQA · 2020-07-31T11:46:28Z

Test build #126857 has finished for PR 29291 at commit fbb051b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T11:55:13Z

Test build #126866 has finished for PR 29291 at commit 9939ea7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-31T17:03:42Z

Test build #126886 has finished for PR 29291 at commit abafc20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-03T07:05:01Z

Test build #126953 has finished for PR 29291 at commit 39583dd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-08-03T07:05:02Z

Test build #126956 has finished for PR 29291 at commit 883973b.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-03T07:47:30Z

retest this please

SparkQA · 2020-08-03T12:13:20Z

Test build #126968 has finished for PR 29291 at commit 883973b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-03T12:16:58Z

retest this please

SparkQA · 2020-08-03T16:00:49Z

Test build #126980 has finished for PR 29291 at commit 883973b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-03T16:04:59Z

retest this please

SparkQA · 2020-08-03T21:17:59Z

Test build #126994 has finished for PR 29291 at commit 883973b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-08-04T04:41:13Z

thanks, merging to master!

beliefer · 2020-08-04T04:55:45Z

@cloud-fan Thanks for your review and good idea.

beliefer and others added 13 commits June 19, 2020 10:36

Reuse completeNextStageWithFetchFailure

4a6f903

Merge remote-tracking branch 'upstream/master'

96456e2

Merge remote-tracking branch 'upstream/master'

4314005

Merge remote-tracking branch 'upstream/master'

d6af4a7

Merge remote-tracking branch 'upstream/master'

f69094f

Merge remote-tracking branch 'upstream/master'

b86a42d

Merge branch 'master' of github.com:beliefer/spark

2ac5159

Merge remote-tracking branch 'upstream/master'

9021d6c

Merge branch 'master' of github.com:beliefer/spark

74a2ef4

Support single distinct group with filter.

199aa6f

Support distinct agg with filter

a73f11e

Supplement doc and comment.

72e95f1

Add test case and regenerate golden files.

8e82e83

probot-autolabeler bot added the SQL label Jul 29, 2020

Add test case and regenerate golden files.

4ba808b

Optimize code

145a9dd

Update doc

0fcf643

cloud-fan reviewed Jul 30, 2020

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Show resolved Hide resolved

beliefer added 2 commits July 30, 2020 22:35

Optimize code.

92a37a9

Optimize code.

7362dfb

beliefer added 2 commits July 31, 2020 11:52

Merge remote-tracking branch 'upstream/master'

9828158

Merge branch 'master' into support-distinct-with-filter

fbb051b

cloud-fan reviewed Jul 31, 2020

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Outdated Show resolved Hide resolved

Add tests case like distinct 1

9939ea7

cloud-fan reviewed Jul 31, 2020

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Show resolved Hide resolved

cloud-fan reviewed Jul 31, 2020

View reviewed changes

beliefer added 2 commits July 31, 2020 19:07

Optimize code

2dc6f32

Optimize code

abafc20

cloud-fan approved these changes Jul 31, 2020

View reviewed changes

beliefer added 2 commits August 3, 2020 13:16

Optimize code

39583dd

Optimize code

883973b

cloud-fan closed this in 1597d8f Aug 4, 2020

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29291

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29291

Uh oh!

Conversation

beliefer commented Jul 29, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 29, 2020

Uh oh!

SparkQA commented Jul 30, 2020

Uh oh!

beliefer commented Jul 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer Jul 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SparkQA commented Jul 30, 2020

Uh oh!

beliefer commented Jul 31, 2020

Uh oh!

cloud-fan commented Jul 31, 2020

Uh oh!

beliefer commented Jul 31, 2020

Uh oh!

Uh oh!

SparkQA commented Jul 31, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 31, 2020

Uh oh!

SparkQA commented Jul 31, 2020

Uh oh!

SparkQA commented Jul 31, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

cloud-fan commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

cloud-fan commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

cloud-fan commented Aug 3, 2020

Uh oh!

SparkQA commented Aug 3, 2020

Uh oh!

cloud-fan commented Aug 4, 2020

beliefer Jul 30, 2020 •

edited

Loading