[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29135

beliefer · 2020-07-16T15:20:24Z

What changes were proposed in this pull request?

This PR is related to #26656.
#26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:

select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;

Note:
In #26656, we use AggregationIterator to treat the filter conditions of aggregate expr. This is good because we can evaluate filter in first aggregate locally.
But AggregationIterator only support single DISTINCT aggregate with filter clause.
So, this PR uses Project to project the filter clause as new generated attribute(e.g. _gen_attr_0) and ensure the evaluation at local.

Why are the changes needed?

Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT

Does this PR introduce any user-facing change?

No

How was this patch tested?

Exists and new UT

SparkQA · 2020-07-16T17:57:22Z

Test build #125985 has finished for PR 29135 at commit 4f9c7e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T04:06:42Z

Test build #126016 has finished for PR 29135 at commit fefbce0.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-17T08:47:59Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ProjectFilterInAggregates.scala

+  *                         if ('id > 1) 'cat2 else null,
+ *                          cast('value as bigint),
+ *                          if ('key = "a") cast('value as bigint) else null]
+ *        output = ['key, '_gen_attr_1, '_gen_attr_2, '_gen_attr_3, '_gen_attr_4])


cat1 is not related to the filter, why do we change its name to _gen_attr_1?

For convenience and unification, we always alias the column, even if there is no filter.

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala

cloud-fan · 2020-07-17T08:57:50Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala

+      // aggregate function. For example,
+      // 1.for AVG(DISTINCT value) GROUP BY key, the grouping expressions will be [key, value].
+      // 2.for AVG (DISTINCT value) Filter (WHERE age > 20) GROUP BY key, the grouping expression
+      // will be [key, value, age].


AVG (DISTINCT value) Filter (WHERE age > 20) this will be rewritten as AVG (DISTINCT _gen_attr) Filter (WHERE _gen_attr is not null). So here we should group by key and _gen_attr?

Thanks for your remind.

SparkQA · 2020-07-17T09:54:50Z

Test build #126036 has finished for PR 29135 at commit 202a454.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T11:18:07Z

Test build #126050 has finished for PR 29135 at commit a2c842e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-17T15:37:50Z

Test build #126052 has finished for PR 29135 at commit 7127744.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-17T15:59:06Z

retest this please

SparkQA · 2020-07-17T20:47:29Z

Test build #126059 has finished for PR 29135 at commit 7127744.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-18T01:31:22Z

retest this please

SparkQA · 2020-07-18T06:18:27Z

Test build #126082 has finished for PR 29135 at commit 7127744.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-18T06:50:24Z

Test build #126084 has finished for PR 29135 at commit 2253499.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-18T07:05:02Z

Test build #126088 has finished for PR 29135 at commit 7159582.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-18T07:05:02Z

Test build #126089 has finished for PR 29135 at commit ba7c3a4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-18T08:13:36Z

retest this please

SparkQA · 2020-07-18T12:40:10Z

Test build #126100 has finished for PR 29135 at commit ba7c3a4.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2020-07-18T12:41:45Z

retest this please

SparkQA · 2020-07-18T17:28:08Z

Test build #126109 has finished for PR 29135 at commit ba7c3a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer and others added 7 commits June 19, 2020 10:36

Reuse completeNextStageWithFetchFailure

4a6f903

Merge remote-tracking branch 'upstream/master'

96456e2

Merge remote-tracking branch 'upstream/master'

4314005

Merge remote-tracking branch 'upstream/master'

d6af4a7

Merge remote-tracking branch 'upstream/master'

f69094f

add new rule to project filter

5427485

idempotence and regenerate golden files.

4f9c7e6

probot-autolabeler bot added the SQL label Jul 16, 2020

generate attr use local index.

fefbce0

generate attr use local index.

202a454

cloud-fan reviewed Jul 17, 2020

View reviewed changes

...alyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala Show resolved Hide resolved

cloud-fan reviewed Jul 17, 2020

View reviewed changes

beliefer added 2 commits July 17, 2020 18:39

Update comment and regenerate golden file.

a2c842e

regenerate golden file.

7127744

beliefer added 4 commits July 18, 2020 09:58

Replace old attr to new attr.

98e97e8

Revert comments.

2253499

Update comments.

7159582

Update comments.

ba7c3a4

beliefer closed this Jul 29, 2020

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29135

[SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT #29135

Uh oh!

Conversation

beliefer commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jul 16, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

cloud-fan Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

beliefer Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cloud-fan Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

beliefer Jul 17, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

beliefer commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

beliefer commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

beliefer commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

beliefer commented Jul 18, 2020

Uh oh!

SparkQA commented Jul 18, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

beliefer commented Jul 16, 2020 •

edited

Loading