Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Nov 7, 2019

What changes were proposed in this pull request?

The filter predicate for aggregate expression is an ANSI SQL.

<aggregate function> ::=
COUNT <left paren> <asterisk> <right paren> [ <filter clause> ]
| <general set function> [ <filter clause> ]
| <binary set function> [ <filter clause> ]
| <ordered set function> [ <filter clause> ]
| <array aggregate function> [ <filter clause> ]
| <row pattern count function> [ <filter clause> ]

There are some mainstream database support this syntax.
PostgreSQL:
https://www.postgresql.org/docs/current/sql-expressions.html#SYNTAX-AGGREGATES
For example:

SELECT
  year,
  count(*) FILTER (WHERE gdp_per_capita >= 40000)
FROM
  countries
GROUP BY
  year
SELECT
  year,
  code,
  gdp_per_capita,
  count(*) 
    FILTER (WHERE gdp_per_capita >= 40000) 
    OVER   (PARTITION BY year)
FROM
  countries

jOOQ:
https://blog.jooq.org/2014/12/30/the-awesome-postgresql-9-4-sql2003-filter-clause-for-aggregate-functions/

Notice:
This PR only support filter predicate without codegen. I will create another PR to support codegen.

There are some show of the PR on my production environment.

spark-sql> desc gja_test_partition;
key     string  NULL
value   string  NULL
other   string  NULL
col2    int     NULL
# Partition Information
# col_name      data_type       comment
col2    int     NULL
Time taken: 0.79 s
spark-sql> select * from gja_test_partition;
a       A       ao      1
b       B       bo      1
c       C       co      1
d       D       do      1
e       E       eo      2
g       G       go      2
h       H       ho      2
j       J       jo      2
f       F       fo      3
k       K       ko      3
l       L       lo      4
i       I       io      4
Time taken: 1.75 s
spark-sql> select count(key), sum(col2) from gja_test_partition;
12      26
Time taken: 1.848 s
spark-sql> select count(key) filter (where col2 > 1) from gja_test_partition;
8
Time taken: 2.926 s
spark-sql> select sum(col2) filter (where col2 > 2) from gja_test_partition;
14
Time taken: 2.087 s
spark-sql> select count(key) filter (where col2 > 1), sum(col2) filter (where col2 > 2) from gja_test_partition;
8       14
Time taken: 2.847 s
spark-sql> select count(key), count(key) filter (where col2 > 1), sum(col2), sum(col2) filter (where col2 > 2) from gja_test_partition;
12      8       26      14
Time taken: 1.787 s
spark-sql> desc student;
id      int     NULL
name    string  NULL
sex     string  NULL
class_id        int     NULL
Time taken: 0.206 s
spark-sql> select * from student;
1       张三    man     1
2       李四    man     1
3       王五    man     2
4       赵六    man     2
5       钱小花  woman   1
6       赵九红  woman   2
7       郭丽丽  woman   2
Time taken: 0.786 s
spark-sql> select class_id, count(id), sum(id) from student group by class_id;
1       3       8
2       4       20
Time taken: 18.783 s
spark-sql> select class_id, count(id) filter (where sex = 'man'), sum(id) filter (where sex = 'woman') from student group by class_id;
1       2       5
2       2       13
Time taken: 3.887 s

Why are the changes needed?

Add new SQL feature.

Does this PR introduce any user-facing change?

'No'.

How was this patch tested?

Exists UT and new UT.

@beliefer
Copy link
Contributor Author

beliefer commented Nov 7, 2019

cc @gatorsmile

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113365 has finished for PR 26420 at commit d521be1.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

beliefer commented Nov 7, 2019

Retest this please.

@SparkQA
Copy link

SparkQA commented Nov 7, 2019

Test build #113370 has finished for PR 26420 at commit d521be1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @cloud-fan @rednaxelafx @maropu

!aggregateExpressions.exists(_.aggregateFunction.isInstanceOf[ImperativeAggregate])
// ImperativeAggregate and filter predicate are not supported right now
!(aggregateExpressions.exists(_.aggregateFunction.isInstanceOf[ImperativeAggregate]) ||
aggregateExpressions.exists(_.filter.isDefined))
Copy link
Member

@maropu maropu Nov 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot support this in the codegen mode? Technically hard? If we support this filter expr in agg, I personally think we'd be better to support in the codegen mode first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review. I think this is a bigger change and I want create two PR to make things simple.
Another reason is I took a lot of time in other jobs.

@maropu maropu changed the title [SPARK-27986] Support ANSI SQL filter predicate for aggregate expression. [SPARK-27986][SQL] Support ANSI SQL filter predicate for aggregate expression. Nov 8, 2019
@maropu
Copy link
Member

maropu commented Nov 8, 2019

I just looked over the current implementation though, I have one question; have you checked that we couldn't just transform this filter exprs into the Spark existing exprs/operators instead of the current approach (hard-coded in each Agg operators)?

<tr><td>FALSE</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr>
<tr><td>FETCH</td><td>reserved</td><td>non-reserved</td><td>reserved</td></tr>
<tr><td>FIELDS</td><td>non-reserved</td><td>non-reserved</td><td>non-reserved</td></tr>
<tr><td>FILTER</td><td>reserved</td><td>non-reserved</td><td>non-reserved</td></tr>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FILTER is reserved in SQL-2011. Also, could you please update TableIdentifierParserSuite, too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks for your remind. I will add it.

}
}
case u @ UnresolvedFunction(funcId, children, isDistinct) =>
case u @ UnresolvedFunction(funcId, children, isDistinct, filter) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you throw an analysis exception if filter given in non-aggregation functions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Thanks for your remind. I will add it.

@cloud-fan
Copy link
Contributor

This is a nice feature! I'd like to know how it's implemented. Seems like we can't transform it into another logical form that we support, and we need to adjust our backend engine.

@maropu
Copy link
Member

maropu commented Nov 8, 2019

I just thought a super simple case like this;

postgres=# select * from t;
 k | v1 | v2 
---+----+----
 1 |  2 |  3
 1 |  4 |  5
 1 |    |  9
 2 |  3 |   
 2 |  5 |  8
(5 rows)

postgres=# select k, sum(v1) filter (where v1 > 2), avg(v2) filter (where v2 < 6) from t group by k;
 k | sum |        avg         
---+-----+--------------------
 2 |   8 |                   
 1 |   4 | 4.0000000000000000
(2 rows)

The query above might be transformed into...

scala> sql("select k, sum(v1), avg(v2) from (select k, if(v1 > 2, v1, null) v1, if(v2 < 6, v2, null) v2 from t) group by k").show()
+---+-------+-------+
|  k|sum(v1)|avg(v2)|
+---+-------+-------+
|  1|      4|    4.0|
|  2|      8|   null|
+---+-------+-------+

@cloud-fan
Copy link
Contributor

@maropu looks like a good idea. But we need to make sure the aggregate function ignore nulls. may not work for count(*) filter (where a > 1)

@maropu
Copy link
Member

maropu commented Nov 8, 2019

Ur, I see.... nice suggestion. I need more time to think about how to implement.

@beliefer
Copy link
Contributor Author

beliefer commented Nov 9, 2019

@maropu
You said:
I just looked over the current implementation though, I have one question; have you checked that we couldn't just transform this filter exprs into the Spark existing exprs/operators instead of the current approach (hard-coded in each Agg operators)?
I don't understand very well. Do you mean that you should convert a filter expression to part of an aggregate expression or part of an aggregate function?

@maropu
Copy link
Member

maropu commented Nov 10, 2019

I meant we might not need to modify the physical plans for aggregates (e.g., HashAggregateExec). Instead, in the analyzer phase, we might be able to transform filter expressions into projections as shown above (Aggregate with Filter => Project + Aggregate);

// For the query "select k, sum(v1) filter (where v1 > 2), avg(v2) filter (where v2 < 6) from t group by k"
scala> sql("select k, sum(v1), avg(v2) from (select k, if(v1 > 2, v1, null) v1, if(v2 < 6, v2, null) v2 from t) group by k").show()
+---+-------+-------+
|  k|sum(v1)|avg(v2)|
+---+-------+-------+
|  1|      4|    4.0|
|  2|      8|   null|
+---+-------+-------+

scala> sql("select k, sum(v1), avg(v2) from (select k, if(v1 > 2, v1, null) v1, if(v2 < 6, v2, null) v2 from t) group by k").explain(true)
== Parsed Logical Plan ==
'Aggregate ['k], ['k, unresolvedalias('sum('v1), None), unresolvedalias('avg('v2), None)]
+- 'SubqueryAlias `__auto_generated_subquery_name`
   +- 'Project ['k, 'if(('v1 > 2), 'v1, null) AS v1#190, 'if(('v2 < 6), 'v2, null) AS v2#191]
      +- 'UnresolvedRelation [t]

== Analyzed Logical Plan ==
k: int, sum(v1): bigint, avg(v2): double
Aggregate [k#127], [k#127, sum(cast(v1#190 as bigint)) AS sum(v1)#194L, avg(cast(v2#191 as bigint)) AS avg(v2)#195]
+- SubqueryAlias `__auto_generated_subquery_name`
   +- Project [k#127, if ((v1#128 > 2)) v1#128 else cast(null as int) AS v1#190, if ((v2#129 < 6)) v2#129 else cast(null as int) AS v2#191]
      +- SubqueryAlias `default`.`t`
         +- Relation[k#127,v1#128,v2#129] parquet

== Optimized Logical Plan ==
Aggregate [k#127], [k#127, sum(cast(v1#190 as bigint)) AS sum(v1)#194L, avg(cast(v2#191 as bigint)) AS avg(v2)#195]
+- Project [k#127, if ((v1#128 > 2)) v1#128 else null AS v1#190, if ((v2#129 < 6)) v2#129 else null AS v2#191]
   +- Relation[k#127,v1#128,v2#129] parquet

== Physical Plan ==
*(2) HashAggregate(keys=[k#127], functions=[sum(cast(v1#190 as bigint)), avg(cast(v2#191 as bigint))], output=[k#127, sum(v1)#194L, avg(v2)#195])
+- Exchange hashpartitioning(k#127, 200), true, [id=#320]
   +- *(1) HashAggregate(keys=[k#127], functions=[partial_sum(cast(v1#190 as bigint)), partial_avg(cast(v2#191 as bigint))], output=[k#127, sum#202L, sum#203, count#204L])
      +- *(1) Project [k#127, if ((v1#128 > 2)) v1#128 else null AS v1#190, if ((v2#129 < 6)) v2#129 else null AS v2#191]
         +- *(1) ColumnarToRow
            +- FileScan parquet default.t[k#127,v1#128,v2#129] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/t], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<k:int,v1:int,v2:int>

@beliefer
Copy link
Contributor Author

beliefer commented Nov 10, 2019

@maropu This is a good idea. As @cloud-fan said how to treat count. In addition to count, there are other situations here. Such as: percentile, percentile_approx, approx_count_distinct and so on. We can't only treat as null.

@rednaxelafx
Copy link
Contributor

I'd like to propose a solution for the codegen part that'll augment this PR. The overall direction this PR is taking sounds good to me, although I haven't reviewed the full details yet (would like to do that some time this week).

I'll prepare a separate PR for demo purposes to show how it'll augment the codegen part. It's actually fairly easy and could also serve as a bit of code clean up for a lot of the declarative aggregate functions.

The tl;dr is that I'd like to have explicit support for the user-specified filter clause in the infrastructure, instead of solely relying on a rewrite.
A lot of aggregate functions are null-skipping by nature, e.g. count(), sum(), avg() etc. But that's not a property common to ALL possible aggregate functions, and some of them have interesting semantics like first()/ last() where you can configure whether or not you want to include the nulls as the result, or skip them and only take the non-null values.
Having explicit support for the filter clause in the infrastructure ensures that we can properly support this feature, without having to rely on logical rewrite that might work for most aggregate functions and then a handful of exception cases have to be implemented in really ugly ways.

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #113566 has finished for PR 26420 at commit f32ac4d.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #113584 has finished for PR 26420 at commit 9ea4736.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

This is a nice feature! I'd like to know how it's implemented. Seems like we can't transform it into another logical form that we support, and we need to adjust our backend engine.

I think so.

result
case UnresolvedExtractValue(child, fieldExpr) if child.resolved =>
ExtractValue(child, fieldExpr, resolver)
case f @ UnresolvedFunction(_, children, _, filter) if filter.isDefined =>
Copy link
Member

@maropu maropu Nov 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this?

        case f @ UnresolvedFunction(_, _, _, Some(filter)) =>
          val newFilter = filter.mapChildren(resolveExpressionTopDown(_, q))
          val newChildren = children.map(resolveExpressionTopDown(_, q))
          f.copy(children = newChildren, filter = Some(newFilter))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will try like this.

override lazy val references: AttributeSet = {
mode match {
case Partial | Complete => aggregateFunction.references
case Partial | Complete if filter == None => aggregateFunction.references
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this match.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you said right.

case (ae: DeclarativeAggregate, expression) =>
expression.mode match {
val filterExpressions = expressions.map(_.filter)
val notExistsFilter = !filterExpressions.exists(_ != None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

notExistsFilter is predicates.isEmpty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

expression.mode match {
val filterExpressions = expressions.map(_.filter)
val notExistsFilter = !filterExpressions.exists(_ != None)
var isFinalOrMerge = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use val for isFinalOrMerge like this?

 val isFinalOrMerge = functions.exists(....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFinalOrMerge is related to expressions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to check if they have PartialMerge or Final;

      val isFinalOrMerge = expressions.map(_.mode)
        .collect { case PartialMerge | Final => true }.nonEmpty

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, plz move this variable to line 223.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

(currentBuffer: InternalRow, row: InternalRow) => {
// Process all expression-based aggregate functions.
updateProjection.target(currentBuffer)(joinedRow(currentBuffer, row))
if (notExistsFilter || isFinalOrMerge) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like this?

if (notExistsFilter || isFinalOrMerge) {
  (currentBuffer: InternalRow, row: InternalRow) => {...}
} else {
  (currentBuffer: InternalRow, row: InternalRow) => {...}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

case Partial | Complete =>
ae.filter.foreach { filterExpr =>
val filterAttrs = filterExpr.references.toSeq
val predicate = newPredicate(filterExpr, child.output ++ filterAttrs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genInterpretedPredicate instead of newPredicate?

btw, in the interpreter mode doExecute() of FileterExec, it seems we currently use generated code from newPredicate for evaluating predicates. Any reason that we cannot turn off codegen there via CODEGEN_FACTORY_MODE? @viirya @cloud-fan

val predicate = newPredicate(condition, child.output)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The background of adding CODEGEN_FACTORY_MODE is to have a config for test only. It is easier for us to test interpreted, codegen paths separately.

For non test, I think we always go codegen first and fallback to interpreted if codegen fails.

override def supportCodegen: Boolean = {
// ImperativeAggregate is not supported right now
!aggregateExpressions.exists(_.aggregateFunction.isInstanceOf[ImperativeAggregate])
// ImperativeAggregate and filter predicate are not supported right now
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you file a new jira for the codegen support of filters in aggregates? Then, put a JIRA ID here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! But let us wait for @rednaxelafx

// so return an empty kvIterator.
Iterator.empty
} else {
val filterPredicates = new HashMap[Int, GenPredicate]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: mutable.HashMap

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

aggregateAttributes: Seq[Attribute],
initialInputBufferOffset: Int,
resultExpressions: Seq[NamedExpression],
predicates: HashMap[Int, GenPredicate],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Map instead of HashMap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

Iterator.empty
} else {
val filterPredicates = new mutable.HashMap[Int, GenPredicate]
aggregateExpressions.zipWithIndex.foreach{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the format foreach{ => foreach {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@maropu
Copy link
Member

maropu commented Nov 14, 2019

I did quick reviews and left some comments, so could you check my comments above?
Also, can you check the query below?

//PgSQL
postgres=# select * from t;
 k | v1 | v2 
---+----+----
 1 |  1 |  1
 2 |  2 |  2
(2 rows)

postgres=# select k, sum(v1) filter (where v1 > (select 1)), avg(v2) from t group by k;
 k | sum |          avg           
---+-----+------------------------
 2 |   2 |     2.0000000000000000
 1 |     | 1.00000000000000000000
(2 rows)

// This pr
scala> sql("select k, sum(v1) filter (where v1 > (select 1)), avg(v2) from t group by k").show()
19/11/14 16:48:15 ERROR Executor: Exception in task 1.0 in stage 13.0 (TID 207)
java.io.InvalidClassException: org.apache.spark.sql.catalyst.expressions.ScalarSubquery; no valid constructor
	at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
	at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2043)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1573)
	at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2287)
	at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2211)
	at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2069)
	at java.io.ObjectInputStream.readObject0(ObjectInputStream

I personally think we need to consider more about the exact BNF grammar for aggregate filters that we will support in this pr.

ExtractValue(child, fieldExpr, resolver)
case f @ UnresolvedFunction(_, children, _, filter) if filter.isDefined =>
val newChildren = children.map(resolveExpressionTopDown(_, q))
val newFilter = filter.map{ expr => expr.mapChildren(resolveExpressionTopDown(_, q))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: format .map{ => .map {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to the above modifications, this problem does not exist.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113872 has finished for PR 26420 at commit 4dcd0d3.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113874 has finished for PR 26420 at commit 060d3d4.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113877 has finished for PR 26420 at commit 4443883.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 15, 2019

Test build #113878 has finished for PR 26420 at commit 8beff8a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

@maropu Filter predicate has supported sub query .

@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114147 has finished for PR 26420 at commit b677268.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cloud-fan pushed a commit that referenced this pull request Nov 20, 2019
### What changes were proposed in this pull request?

This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`.
Modifications are listed below;
 - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala`
 - To resolve the name conflict,  rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate`
 - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate`

This comes from the cloud-fan suggestion: #26420 (comment)

### Why are the changes needed?

For better code/test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #26604 from maropu/RefactorPredicate.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@SparkQA
Copy link

SparkQA commented Nov 20, 2019

Test build #114161 has finished for PR 26420 at commit 4c644ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114188 has finished for PR 26420 at commit 255650a.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class SpecificPredicate extends $
  • case class ArraySort(
  • case class TypeOf(child: Expression) extends UnaryExpression
  • abstract class BasePredicate
  • case class RenameTableStatement(
  • case class AlterNamespaceSetLocationStatement(
  • case class RenameTable(
  • case class LocalShuffleReaderExec(
  • case class RenameTableExec(

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114200 has finished for PR 26420 at commit 518aa4f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

Retest this please

@SparkQA
Copy link

SparkQA commented Nov 21, 2019

Test build #114220 has finished for PR 26420 at commit 518aa4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer
Copy link
Contributor Author

@cloud-fan @maropu Could you continue to review this PR?

name: FunctionIdentifier,
children: Seq[Expression],
isDistinct: Boolean)
inputs: Seq[Expression],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: input -> arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

assert (aggregateExpressions.get.size == 1)
checkAnswer(df, Row(1, 3, 4) :: Row(2, 3, 4) :: Row(3, 3, 4) :: Nil)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need more exhaustive tests for supporting group filter. Could you add tests in SQLQueryTestSuite, e.g., input/group-by-filter.sql`?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will added this in new PR.

case Partial | Complete =>
case Partial | Complete if filterExpressions(i).isDefined =>
(buffer: InternalRow, row: InternalRow) =>
if (predicates(i).eval(row)) { ae.update(buffer, row) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you use the two variables predicates and filterExpressions for filter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will use predicates only.

inputs --> arguments
inputs --> arguments
if (predicates.isEmpty || isFinalOrMerge) {
(currentBuffer: InternalRow, row: InternalRow) => {
updateProjection.target(currentBuffer)(joinedRow(currentBuffer, row))
processImperative(currentBuffer, row)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worrid that this cloure can cause some performance overhead when processing regular non-filter aggregate functions. cc: @cloud-fan

expression.mode match {
val filterExpressions = expressions.map(_.filter)
var isFinalOrMerge = false
val mergeExpressions = functions.zipWithIndex.collect {
Copy link
Member

@maropu maropu Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you change functions.zip(expressions).flatMap to functions.zipWithIndex.collect here?

Copy link
Contributor Author

@beliefer beliefer Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 248 and 250 will use the index,so I make this change

case Partial | Complete if filterExpressions(i).isDefined =>
(buffer: InternalRow, row: InternalRow) =>
if (predicates(i).eval(row)) { ae.update(buffer, row) }
case Partial | Complete if filterExpressions(i).isEmpty =>
Copy link
Member

@maropu maropu Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: like this?

            case Partial | Complete =>
              if (predicateOptions(i).isDefined) {
                (buffer: InternalRow, row: InternalRow) =>
                  if (predicateOptions(i).get.eval(row)) { ae.update(buffer, row) }
              } else {
                (buffer: InternalRow, row: InternalRow) => ae.update(buffer, row)
              }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have improved in another way

filterPredicates
}

protected val predicates: mutable.Map[Int, BasePredicate] =
Copy link
Member

@maropu maropu Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need this variable outside generateProcessRow, so can you move this variable inside it like this?

  // Initializing functions used to process a row.
  protected def generateProcessRow(
      expressions: Seq[AggregateExpression],
      functions: Seq[AggregateFunction],
      inputAttributes: Seq[Attribute]): (InternalRow, InternalRow) => Unit = {
    val joinedRow = new JoinedRow
    if (expressions.nonEmpty) {
      // Initialize predicates for aggregate functions if necessary
      val predicateOptions = expressions.map {
        case AggregateExpression(_, mode, _, Some(filter), _) =>
          mode match {
            case Partial | Complete =>
              val filterAttrs = filter.references.toSeq
              val predicate = Predicate.create(filter, inputAttributes ++ filterAttrs)
              predicate.initialize(partIndex)
              Some(predicate)
            case _ =>
              None
          }
        case _ =>
          None
      }
      ....

// 2-3. Filter the data row using filter predicate filterC. If the filter predicate
// filterC is met, then calculate using aggregate expression exprC.
(currentBuffer: InternalRow, row: InternalRow) => {
val dynamicMergeExpressions = new mutable.ArrayBuffer[Expression]
Copy link
Member

@maropu maropu Nov 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the predicate process for expression-based agg functions outside this row-by-row loop? The current code can cause overkilling overhead when processing rows....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need some time to think about it.

@SparkQA
Copy link

SparkQA commented Nov 23, 2019

Test build #114314 has finished for PR 26420 at commit 244adc6.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2019

Test build #114313 has finished for PR 26420 at commit 6e2e2b7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2019

Test build #114315 has finished for PR 26420 at commit ba14173.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

checkAnswer(df, Row(1, 3, 4) :: Row(2, 3, 4) :: Row(3, 3, 4) :: Nil)
}

test("SPARK-27986: support filter clause for aggregate function with hash") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the prefix now: apache/spark-website#231

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will remove the prefix in new PR.

@maropu
Copy link
Member

maropu commented Nov 23, 2019

Can you check the BNF grammar for <filter clause> defined in the ANSI/SQL standard?

@SparkQA
Copy link

SparkQA commented Nov 23, 2019

Test build #114312 has finished for PR 26420 at commit 92330b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@beliefer beliefer closed this Nov 24, 2019
@maropu
Copy link
Member

maropu commented Nov 25, 2019

@beliefer Why did you close this?

@beliefer
Copy link
Contributor Author

beliefer commented Nov 25, 2019

@maropu I made a mistake and got rid of the branch information. Thank you, I will create another fork branch and another PR.
I restored this fork branch at #26656

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants