[SPARK-27986][SQL] Support ANSI SQL filter predicate for aggregate expression. #26420

beliefer · 2019-11-07T06:38:41Z

What changes were proposed in this pull request?

The filter predicate for aggregate expression is an ANSI SQL.

<aggregate function> ::=
COUNT <left paren> <asterisk> <right paren> [ <filter clause> ]
| <general set function> [ <filter clause> ]
| <binary set function> [ <filter clause> ]
| <ordered set function> [ <filter clause> ]
| <array aggregate function> [ <filter clause> ]
| <row pattern count function> [ <filter clause> ]

There are some mainstream database support this syntax.
PostgreSQL:
https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
For example:

SELECT
  year,
  count(*) FILTER (WHERE gdp_per_capita >= 40000)
FROM
  countries
GROUP BY
  year

SELECT
  year,
  code,
  gdp_per_capita,
  count(*) 
    FILTER (WHERE gdp_per_capita >= 40000) 
    OVER   (PARTITION BY year)
FROM
  countries

jOOQ:
https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/

Notice:
This PR only support filter predicate without codegen. I will create another PR to support codegen.

There are some show of the PR on my production environment.

spark-sql> desc gja_test_partition;
key     string  NULL
value   string  NULL
other   string  NULL
col2    int     NULL
# Partition Information
# col_name      data_type       comment
col2    int     NULL
Time taken: 0.79 s

spark-sql> select * from gja_test_partition;
a       A       ao      1
b       B       bo      1
c       C       co      1
d       D       do      1
e       E       eo      2
g       G       go      2
h       H       ho      2
j       J       jo      2
f       F       fo      3
k       K       ko      3
l       L       lo      4
i       I       io      4
Time taken: 1.75 s

spark-sql> select count(key), sum(col2) from gja_test_partition;
12      26
Time taken: 1.848 s

spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition;
8
Time taken: 2.926 s

spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition;
14
Time taken: 2.087 s

spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition;
8       14
Time taken: 2.847 s

spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition;
12      8       26      14
Time taken: 1.787 s

spark-sql> desc student;
id      int     NULL
name    string  NULL
sex     string  NULL
class_id        int     NULL
Time taken: 0.206 s

spark-sql> select * from student;
1       张三    man     1
2       李四    man     1
3       王五    man     2
4       赵六    man     2
5       钱小花  woman   1
6       赵九红  woman   2
7       郭丽丽  woman   2
Time taken: 0.786 s

spark-sql> select class_id, count(id), sum(id) from student group by class_id;
1       3       8
2       4       20
Time taken: 18.783 s

spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id;
1       2       5
2       2       13
Time taken: 3.887 s

Why are the changes needed?

Add new SQL feature.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Exists UT and new UT.

beliefer · 2019-11-07T06:39:10Z

cc @gatorsmile

SparkQA · 2019-11-07T08:05:02Z

Test build #113365 has finished for PR 26420 at commit d521be1.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-11-07T08:08:03Z

Retest this please.

SparkQA · 2019-11-07T11:55:12Z

Test build #113370 has finished for PR 26420 at commit d521be1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2019-11-08T06:19:58Z

cc @cloud-fan @rednaxelafx @maropu

maropu · 2019-11-08T08:15:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

-    !aggregateExpressions.exists(_.aggregateFunction.isInstanceOf[ImperativeAggregate])
+    // ImperativeAggregate and filter predicate are not supported right now
+    !(aggregateExpressions.exists(_.aggregateFunction.isInstanceOf[ImperativeAggregate]) ||
+        aggregateExpressions.exists(_.filter.isDefined))


We cannot support this in the codegen mode? Technically hard? If we support this filter expr in agg, I personally think we'd be better to support in the codegen mode first.

Thanks for your review. I think this is a bigger change and I want create two PR to make things simple.
Another reason is I took a lot of time in other jobs.

maropu · 2019-11-08T11:35:07Z

I just looked over the current implementation though, I have one question; have you checked that we couldn't just transform this filter exprs into the Spark existing exprs/operators instead of the current approach (hard-coded in each Agg operators)?

maropu · 2019-11-08T12:01:48Z

docs/sql-keywords.md

  <tr><td>FALSE</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr>
  <tr><td>FETCH</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr>
  <tr><td>FIELDS</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr>
+  <tr><td>FILTER</td><td>reserved</td><td>non-reserved</td><td>non-reserved</td></tr>


FILTER is reserved in SQL-2011. Also, could you please update TableIdentifierParserSuite, too.

@maropu Thanks for your remind. I will add it.

maropu · 2019-11-08T12:05:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

              }
            }
-          case u @ UnresolvedFunction(funcId, children, isDistinct) =>
+          case u @ UnresolvedFunction(funcId, children, isDistinct, filter) =>


Could you throw an analysis exception if filter given in non-aggregation functions?

@maropu Thanks for your remind. I will add it.

cloud-fan · 2019-11-08T13:29:15Z

This is a nice feature! I'd like to know how it's implemented. Seems like we can't transform it into another logical form that we support, and we need to adjust our backend engine.

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala

maropu · 2019-11-08T13:56:44Z

I just thought a super simple case like this;

postgres=# select * from t;
 k | v1 | v2 
---+----+----
 1 |  2 |  3
 1 |  4 |  5
 1 |    |  9
 2 |  3 |   
 2 |  5 |  8
(5 rows)

postgres=# select k, sum(v1) filter (where v1 > 2), avg(v2) filter (where v2 < 6) from t group by k;
 k | sum |        avg         
---+-----+--------------------
 2 |   8 |                   
 1 |   4 | 4.0000000000000000
(2 rows)

The query above might be transformed into...

scala> sql("select k, sum(v1), avg(v2) from (select k, if(v1 > 2, v1, null) v1, if(v2 < 6, v2, null) v2 from t) group by k").show()
+---+-------+-------+
|  k|sum(v1)|avg(v2)|
+---+-------+-------+
|  1|      4|    4.0|
|  2|      8|   null|
+---+-------+-------+

cloud-fan · 2019-11-08T14:06:21Z

@maropu looks like a good idea. But we need to make sure the aggregate function ignore nulls. may not work for count(*) filter (where a > 1)

maropu · 2019-11-08T14:27:31Z

Ur, I see.... nice suggestion. I need more time to think about how to implement.

beliefer · 2019-11-09T12:56:06Z

@maropu
You said：
I just looked over the current implementation though, I have one question; have you checked that we couldn't just transform this filter exprs into the Spark existing exprs/operators instead of the current approach (hard-coded in each Agg operators)?
I don't understand very well. Do you mean that you should convert a filter expression to part of an aggregate expression or part of an aggregate function?

maropu · 2019-11-10T11:37:44Z

I meant we might not need to modify the physical plans for aggregates (e.g., HashAggregateExec). Instead, in the analyzer phase, we might be able to transform filter expressions into projections as shown above (Aggregate with Filter => Project + Aggregate);

// For the query "select k, sum(v1) filter (where v1 > 2), avg(v2) filter (where v2 < 6) from t group by k"
scala> sql("select k, sum(v1), avg(v2) from (select k, if(v1 > 2, v1, null) v1, if(v2 < 6, v2, null) v2 from t) group by k").show()
+---+-------+-------+
|  k|sum(v1)|avg(v2)|
+---+-------+-------+
|  1|      4|    4.0|
|  2|      8|   null|
+---+-------+-------+

scala> sql("select k, sum(v1), avg(v2) from (select k, if(v1 > 2, v1, null) v1, if(v2 < 6, v2, null) v2 from t) group by k").explain(true)
== Parsed Logical Plan ==
'Aggregate ['k], ['k, unresolvedalias('sum('v1), None), unresolvedalias('avg('v2), None)]
+- 'SubqueryAlias `__auto_generated_subquery_name`
   +- 'Project ['k, 'if(('v1 > 2), 'v1, null) AS v1#190, 'if(('v2 < 6), 'v2, null) AS v2#191]
      +- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==
k: int, sum(v1): bigint, avg(v2): double
Aggregate [k#127], [k#127, sum(cast(v1#190 as bigint)) AS sum(v1)#194L, avg(cast(v2#191 as bigint)) AS avg(v2)#195]
+- SubqueryAlias `__auto_generated_subquery_name`
   +- Project [k#127, if ((v1#128 > 2)) v1#128 else cast(null as int) AS v1#190, if ((v2#129 < 6)) v2#129 else cast(null as int) AS v2#191]
      +- SubqueryAlias `default`.`t`
         +- Relation[k#127,v1#128,v2#129] parquet

== Optimized Logical Plan ==
Aggregate [k#127], [k#127, sum(cast(v1#190 as bigint)) AS sum(v1)#194L, avg(cast(v2#191 as bigint)) AS avg(v2)#195]
+- Project [k#127, if ((v1#128 > 2)) v1#128 else null AS v1#190, if ((v2#129 < 6)) v2#129 else null AS v2#191]
   +- Relation[k#127,v1#128,v2#129] parquet

== Physical Plan ==
*(2) HashAggregate(keys=[k#127], functions=[sum(cast(v1#190 as bigint)), avg(cast(v2#191 as bigint))], output=[k#127, sum(v1)#194L, avg(v2)#195])
+- Exchange hashpartitioning(k#127, 200), true, [id=#320]
   +- *(1) HashAggregate(keys=[k#127], functions=[partial_sum(cast(v1#190 as bigint)), partial_avg(cast(v2#191 as bigint))], output=[k#127, sum#202L, sum#203, count#204L])
      +- *(1) Project [k#127, if ((v1#128 > 2)) v1#128 else null AS v1#190, if ((v2#129 < 6)) v2#129 else null AS v2#191]
         +- *(1) ColumnarToRow
            +- FileScan parquet default.t[k#127,v1#128,v2#129] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v1:int,v2:int>

beliefer · 2019-11-10T14:35:06Z

@maropu This is a good idea. As @cloud-fan said how to treat count. In addition to count, there are other situations here. Such as: percentile, percentile_approx, approx_count_distinct and so on. We can't only treat as null.

rednaxelafx · 2019-11-11T06:48:12Z

I'd like to propose a solution for the codegen part that'll augment this PR. The overall direction this PR is taking sounds good to me, although I haven't reviewed the full details yet (would like to do that some time this week).

I'll prepare a separate PR for demo purposes to show how it'll augment the codegen part. It's actually fairly easy and could also serve as a bit of code clean up for a lot of the declarative aggregate functions.

The tl;dr is that I'd like to have explicit support for the user-specified filter clause in the infrastructure, instead of solely relying on a rewrite.
A lot of aggregate functions are null-skipping by nature, e.g. count(), sum(), avg() etc. But that's not a property common to ALL possible aggregate functions, and some of them have interesting semantics like first()/ last() where you can configure whether or not you want to include the nulls as the result, or skip them and only take the non-null values.
Having explicit support for the filter clause in the infrastructure ensures that we can properly support this feature, without having to rely on logical rewrite that might work for most aggregate functions and then a handful of exception cases have to be implemented in really ugly ways.

SparkQA · 2019-11-11T08:05:02Z

Test build #113566 has finished for PR 26420 at commit f32ac4d.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-11T14:27:25Z

Test build #113584 has finished for PR 26420 at commit 9ea4736.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-11-12T02:06:06Z

This is a nice feature! I'd like to know how it's implemented. Seems like we can't transform it into another logical form that we support, and we need to adjust our backend engine.

I think so.

maropu · 2019-11-14T08:03:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          result
        case UnresolvedExtractValue(child, fieldExpr) if child.resolved =>
          ExtractValue(child, fieldExpr, resolver)
+        case f @ UnresolvedFunction(_, children, _, filter) if filter.isDefined =>


like this?

case f @ UnresolvedFunction(_, _, _, Some(filter)) => val newFilter = filter.mapChildren(resolveExpressionTopDown(_, q)) val newChildren = children.map(resolveExpressionTopDown(_, q)) f.copy(children = newChildren, filter = Some(newFilter))

OK. I will try like this.

maropu · 2019-11-14T08:11:41Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala

  override lazy val references: AttributeSet = {
    mode match {
-      case Partial | Complete => aggregateFunction.references
+      case Partial | Complete if filter == None => aggregateFunction.references


I think we don't need this match.

Yeah, you said right.

maropu · 2019-11-14T08:13:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

-        case (ae: DeclarativeAggregate, expression) =>
-          expression.mode match {
+      val filterExpressions = expressions.map(_.filter)
+      val notExistsFilter = !filterExpressions.exists(_ != None)


notExistsFilter is predicates.isEmpty?

Good suggestion!

maropu · 2019-11-14T08:15:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

-          expression.mode match {
+      val filterExpressions = expressions.map(_.filter)
+      val notExistsFilter = !filterExpressions.exists(_ != None)
+      var isFinalOrMerge = false


Can you use val for isFinalOrMerge like this?

val isFinalOrMerge = functions.exists(....)

isFinalOrMerge is related to expressions.

If you want to check if they have PartialMerge or Final;

val isFinalOrMerge = expressions.map(_.mode) .collect { case PartialMerge | Final => true }.nonEmpty

Then, plz move this variable to line 223.

maropu · 2019-11-14T08:19:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

      (currentBuffer: InternalRow, row: InternalRow) => {
        // Process all expression-based aggregate functions.
-        updateProjection.target(currentBuffer)(joinedRow(currentBuffer, row))
+        if (notExistsFilter || isFinalOrMerge) {


like this?

if (notExistsFilter || isFinalOrMerge) { (currentBuffer: InternalRow, row: InternalRow) => {...} } else { (currentBuffer: InternalRow, row: InternalRow) => {...} }

maropu · 2019-11-14T08:30:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

+              case Partial | Complete =>
+                ae.filter.foreach { filterExpr =>
+                  val filterAttrs = filterExpr.references.toSeq
+                  val predicate = newPredicate(filterExpr, child.output ++ filterAttrs)


genInterpretedPredicate instead of newPredicate?

btw, in the interpreter mode doExecute() of FileterExec, it seems we currently use generated code from newPredicate for evaluating predicates. Any reason that we cannot turn off codegen there via CODEGEN_FACTORY_MODE? @viirya @cloud-fan

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala

Line 230 in e46e487

val predicate = newPredicate(condition, child.output)

The background of adding CODEGEN_FACTORY_MODE is to have a config for test only. It is easier for us to test interpreted, codegen paths separately.

For non test, I think we always go codegen first and fallback to interpreted if codegen fails.

maropu · 2019-11-14T08:31:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

  override def supportCodegen: Boolean = {
-    // ImperativeAggregate is not supported right now
-    !aggregateExpressions.exists(_.aggregateFunction.isInstanceOf[ImperativeAggregate])
+    // ImperativeAggregate and filter predicate are not supported right now


Can you file a new jira for the codegen support of filters in aggregates? Then, put a JIRA ID here.

No problem! But let us wait for @rednaxelafx

maropu · 2019-11-14T08:33:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/ObjectHashAggregateExec.scala

        // so return an empty kvIterator.
        Iterator.empty
      } else {
+        val filterPredicates = new HashMap[Int, GenPredicate]


nit: mutable.HashMap

maropu · 2019-11-14T08:34:09Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

    aggregateAttributes: Seq[Attribute],
    initialInputBufferOffset: Int,
    resultExpressions: Seq[NamedExpression],
+    predicates: HashMap[Int, GenPredicate],


Map instead of HashMap?

maropu · 2019-11-14T08:34:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala

        Iterator.empty
      } else {
+        val filterPredicates = new mutable.HashMap[Int, GenPredicate]
+        aggregateExpressions.zipWithIndex.foreach{


nit: the format foreach{ => foreach {

maropu · 2019-11-14T08:44:44Z

I did quick reviews and left some comments, so could you check my comments above?
Also, can you check the query below?

//PgSQL
postgres=# select * from t;
 k | v1 | v2 
---+----+----
 1 |  1 |  1
 2 |  2 |  2
(2 rows)

postgres=# select k, sum(v1) filter (where v1 > (select 1)), avg(v2) from t group by k;
 k | sum |          avg           
---+-----+------------------------
 2 |   2 |     2.0000000000000000
 1 |     | 1.00000000000000000000
(2 rows)

// This pr
scala> sql("select k, sum(v1) filter (where v1 > (select 1)), avg(v2) from t group by k").show()
19/11/14 16:48:15 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 207)
java.io.InvalidClassException: org.apache.spark.sql.catalyst.expressions.ScalarSubquery; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2043)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream

I personally think we need to consider more about the exact BNF grammar for aggregate filters that we will support in this pr.

maropu · 2019-11-14T08:45:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

          ExtractValue(child, fieldExpr, resolver)
+        case f @ UnresolvedFunction(_, children, _, filter) if filter.isDefined =>
+          val newChildren = children.map(resolveExpressionTopDown(_, q))
+          val newFilter = filter.map{ expr => expr.mapChildren(resolveExpressionTopDown(_, q))}


nit: format .map{ => .map {

Due to the above modifications, this problem does not exist.

SparkQA · 2019-11-15T10:40:19Z

Test build #113872 has finished for PR 26420 at commit 4dcd0d3.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T10:56:18Z

Test build #113874 has finished for PR 26420 at commit 060d3d4.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T11:32:03Z

Test build #113877 has finished for PR 26420 at commit 4443883.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-15T12:47:06Z

Test build #113878 has finished for PR 26420 at commit 8beff8a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-11-20T08:22:57Z

@maropu Filter predicate has supported sub query .

SparkQA · 2019-11-20T10:19:43Z

Test build #114147 has finished for PR 26420 at commit b677268.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`. Modifications are listed below; - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala` - To resolve the name conflict, rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate` - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate` This comes from the cloud-fan suggestion: #26420 (comment) ### Why are the changes needed? For better code/test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26604 from maropu/RefactorPredicate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

SparkQA · 2019-11-20T15:12:27Z

Test build #114161 has finished for PR 26420 at commit 4c644ca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-21T02:18:50Z

Test build #114188 has finished for PR 26420 at commit 255650a.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class SpecificPredicate extends $
case class ArraySort(
case class TypeOf(child: Expression) extends UnaryExpression
abstract class BasePredicate
case class RenameTableStatement(
case class AlterNamespaceSetLocationStatement(
case class RenameTable(
case class LocalShuffleReaderExec(
case class RenameTableExec(

SparkQA · 2019-11-21T08:05:01Z

Test build #114200 has finished for PR 26420 at commit 518aa4f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-11-21T08:09:32Z

Retest this please

SparkQA · 2019-11-21T12:23:43Z

Test build #114220 has finished for PR 26420 at commit 518aa4f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

beliefer · 2019-11-22T02:05:35Z

@cloud-fan @maropu Could you continue to review this PR?

maropu · 2019-11-23T01:11:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

    name: FunctionIdentifier,
-    children: Seq[Expression],
-    isDistinct: Boolean)
+    inputs: Seq[Expression],


nit: input -> arguments

maropu · 2019-11-23T01:44:48Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    assert (aggregateExpressions.get.size == 1)
    checkAnswer(df, Row(1, 3, 4) :: Row(2, 3, 4) :: Row(3, 3, 4) :: Nil)
  }



I think we need more exhaustive tests for supporting group filter. Could you add tests in SQLQueryTestSuite, e.g., input/group-by-filter.sql`?

OK. I will added this in new PR.

maropu · 2019-11-23T08:18:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

-            case Partial | Complete =>
+            case Partial | Complete if filterExpressions(i).isDefined =>
+              (buffer: InternalRow, row: InternalRow) =>
+                if (predicates(i).eval(row)) { ae.update(buffer, row) }


Why did you use the two variables predicates and filterExpressions for filter?

OK. I will use predicates only.

inputs --> arguments

maropu · 2019-11-23T09:08:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

+      if (predicates.isEmpty || isFinalOrMerge) {
+        (currentBuffer: InternalRow, row: InternalRow) => {
+          updateProjection.target(currentBuffer)(joinedRow(currentBuffer, row))
+          processImperative(currentBuffer, row)


I'm a bit worrid that this cloure can cause some performance overhead when processing regular non-filter aggregate functions. cc: @cloud-fan

maropu · 2019-11-23T09:20:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

-          expression.mode match {
+      val filterExpressions = expressions.map(_.filter)
+      var isFinalOrMerge = false
+      val mergeExpressions = functions.zipWithIndex.collect {


Why did you change functions.zip(expressions).flatMap to functions.zipWithIndex.collect here?

Line 248 and 250 will use the index，so I make this change

maropu · 2019-11-23T09:27:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

+            case Partial | Complete if filterExpressions(i).isDefined =>
+              (buffer: InternalRow, row: InternalRow) =>
+                if (predicates(i).eval(row)) { ae.update(buffer, row) }
+            case Partial | Complete if filterExpressions(i).isEmpty =>


nit: like this?

case Partial | Complete => if (predicateOptions(i).isDefined) { (buffer: InternalRow, row: InternalRow) => if (predicateOptions(i).get.eval(row)) { ae.update(buffer, row) } } else { (buffer: InternalRow, row: InternalRow) => ae.update(buffer, row) }

I have improved in another way

maropu · 2019-11-23T09:43:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

+    filterPredicates
+  }
+
+  protected val predicates: mutable.Map[Int, BasePredicate] =


I think we don't need this variable outside generateProcessRow, so can you move this variable inside it like this?

// Initializing functions used to process a row. protected def generateProcessRow( expressions: Seq[AggregateExpression], functions: Seq[AggregateFunction], inputAttributes: Seq[Attribute]): (InternalRow, InternalRow) => Unit = { val joinedRow = new JoinedRow if (expressions.nonEmpty) { // Initialize predicates for aggregate functions if necessary val predicateOptions = expressions.map { case AggregateExpression(_, mode, _, Some(filter), _) => mode match { case Partial | Complete => val filterAttrs = filter.references.toSeq val predicate = Predicate.create(filter, inputAttributes ++ filterAttrs) predicate.initialize(partIndex) Some(predicate) case _ => None } case _ => None } ....

maropu · 2019-11-23T09:48:23Z

sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggregationIterator.scala

+        // 2-3. Filter the data row using filter predicate filterC. If the filter predicate
+        //      filterC is met, then calculate using aggregate expression exprC.
+        (currentBuffer: InternalRow, row: InternalRow) => {
+          val dynamicMergeExpressions = new mutable.ArrayBuffer[Expression]


Can you move the predicate process for expression-based agg functions outside this row-by-row loop? The current code can cause overkilling overhead when processing rows....

I need some time to think about it.

SparkQA · 2019-11-23T10:03:27Z

Test build #114314 has finished for PR 26420 at commit 244adc6.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-23T10:47:31Z

Test build #114313 has finished for PR 26420 at commit 6e2e2b7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-23T11:29:48Z

Test build #114315 has finished for PR 26420 at commit ba14173.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-23T12:11:09Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    checkAnswer(df, Row(1, 3, 4) :: Row(2, 3, 4) :: Row(3, 3, 4) :: Nil)
  }

+  test("SPARK-27986: support filter clause for aggregate function with hash") {


we don't need the prefix now: apache/spark-website#231

OK. I will remove the prefix in new PR.

maropu · 2019-11-23T12:17:52Z

Can you check the BNF grammar for <filter clause> defined in the ANSI/SQL standard?

SparkQA · 2019-11-23T12:50:35Z

Test build #114312 has finished for PR 26420 at commit 92330b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-11-25T01:03:02Z

@beliefer Why did you close this?

beliefer · 2019-11-25T01:44:21Z

@maropu I made a mistake and got rid of the branch information. Thank you, I will create another fork branch and another PR.
I restored this fork branch at #26656

maropu reviewed Nov 8, 2019

View reviewed changes

maropu changed the title ~~[SPARK-27986] Support ANSI SQL filter predicate for aggregate expression.~~ [SPARK-27986][SQL] Support ANSI SQL filter predicate for aggregate expression. Nov 8, 2019

maropu reviewed Nov 8, 2019

View reviewed changes

cloud-fan reviewed Nov 8, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/higherOrderFunctions.scala Outdated Show resolved Hide resolved

dongjoon-hyun added the SQL label Nov 9, 2019

maropu reviewed Nov 14, 2019

View reviewed changes

maropu mentioned this pull request Nov 20, 2019

[SPARK-29968][SQL] Remove the Predicate code from SparkPlan #26604

Closed

maropu reviewed Nov 23, 2019

View reviewed changes

beliefer added 2 commits November 23, 2019 17:02

Update unresolved.scala

b94af4a

inputs --> arguments

Update Analyzer.scala

43b15ae

inputs --> arguments

maropu reviewed Nov 23, 2019

View reviewed changes

beliefer closed this Nov 24, 2019

beliefer mentioned this pull request Nov 25, 2019

[SPARK-27986][SQL] Support ANSI SQL filter clause for aggregate expression #26656

Closed

[SPARK-27986][SQL] Support ANSI SQL filter predicate for aggregate expression. #26420

[SPARK-27986][SQL] Support ANSI SQL filter predicate for aggregate expression. #26420

Uh oh!

Conversation

beliefer commented Nov 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

beliefer commented Nov 7, 2019

Uh oh!

SparkQA commented Nov 7, 2019

Uh oh!

beliefer commented Nov 7, 2019

Uh oh!

SparkQA commented Nov 7, 2019

Uh oh!

gatorsmile commented Nov 8, 2019

Uh oh!

maropu Nov 8, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Nov 8, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Nov 8, 2019

Uh oh!

Uh oh!

maropu commented Nov 8, 2019

Uh oh!

cloud-fan commented Nov 8, 2019

Uh oh!

maropu commented Nov 8, 2019

Uh oh!

beliefer commented Nov 9, 2019

Uh oh!

maropu commented Nov 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beliefer commented Nov 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rednaxelafx commented Nov 11, 2019

Uh oh!

SparkQA commented Nov 11, 2019

Uh oh!

SparkQA commented Nov 11, 2019

Uh oh!

beliefer commented Nov 12, 2019

Uh oh!

maropu Nov 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer commented Nov 7, 2019 •

edited

Loading

maropu Nov 8, 2019 •

edited

Loading

maropu commented Nov 10, 2019 •

edited

Loading

beliefer commented Nov 10, 2019 •

edited

Loading

maropu Nov 14, 2019 •

edited

Loading

maropu commented Nov 14, 2019 •

edited

Loading