[SPARK-22285] [SQL] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate #19506

wzhfy · 2017-10-16T09:01:15Z

What changes were proposed in this pull request?

The current implementation of ApproxCountDistinctForIntervals is ImperativeAggregate. The number of aggBufferAttributes is the number of total words in the hllppHelper array. Each hllppHelper has 52 words by default relativeSD.

Since this aggregate function is used in equi-height histogram generation, and the number of buckets in histogram is usually hundreds, the number of aggBufferAttributes can easily reach tens of thousands or even more.

This leads to a huge method in codegen and causes error:

org.codehaus.janino.JaninoRuntimeException: Code of method "apply(Lorg/apache/spark/sql/catalyst/InternalRow;)Lorg/apache/spark/sql/catalyst/expressions/UnsafeRow;" of class "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection" grows beyond 64 KB.

Besides, huge generated methods also result in performance regression.

In this PR, we change its implementation to TypedImperativeAggregate. After the fix, ApproxCountDistinctForIntervals can deal with more than thousands endpoints without throwing codegen error, and improve performance from 20 sec to 2 sec in a test case of 500 endpoints.

How was this patch tested?

Test by an added test case and existing tests.

wzhfy · 2017-10-16T09:01:53Z

cc @cloud-fan

wzhfy · 2017-10-16T10:48:25Z

test this please

SparkQA · 2017-10-16T14:54:19Z

Test build #82798 has finished for PR 19506 at commit 652b301.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ApproxCountDistinctForIntervalsQuerySuite extends QueryTest with SharedSQLContext

SparkQA · 2017-10-16T14:58:14Z

Test build #82796 has finished for PR 19506 at commit 652b301.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ApproxCountDistinctForIntervalsQuerySuite extends QueryTest with SharedSQLContext

cloud-fan · 2017-10-19T15:43:50Z

...la/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala

  override def prettyName: String = "approx_count_distinct_for_intervals"
+
+  override def serialize(obj: Array[Long]): Array[Byte] = {
+    val buffer = ByteBuffer.wrap(new Array(obj.length * Longs.BYTES))


IIRC ByteBuffer is pretty slow for writing, shall we use unsafe writing?

Changed to unsafe writing, could you take another look?

cloud-fan · 2017-10-19T15:46:17Z

...la/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala

-      val offset = mutableAggBufferOffset + hllppIndex * numWordsPerHllpp
-      hllppArray(hllppIndex).update(buffer, offset, value, child.dataType)
+      val offset = hllppIndex * numWordsPerHllpp
+      hllppArray(hllppIndex).update(LongArrayInput(buffer), offset, value, child.dataType)


you can just pass InternalRow(buffer) here, to save a lot of code changes. If performance matters here, you can create a LongArrayInternalRow to avoid boxing.

InternalRow(buffer) will copy the buffer.
Creating a LongArrayInternalRow is a good idea, thanks!

This reverts commit ba75112.

SparkQA · 2017-10-20T11:41:35Z

Test build #82929 has finished for PR 19506 at commit 2d1b070.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-21T11:16:49Z

Test build #82947 has finished for PR 19506 at commit 1b75428.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-10-21T11:34:53Z

Test build #82948 has finished for PR 19506 at commit 49b4ac2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-22T07:38:27Z

...la/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala

+
+  override def serialize(obj: Array[Long]): Array[Byte] = {
+    val byteArray = new Array[Byte](obj.length * 8)
+    obj.indices.foreach { i =>


use while loop here for better performance in Scala, as this is a performance sensitive code path.

Fixed. Thanks for the reminder!

cloud-fan · 2017-10-22T07:39:11Z

...la/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala

+  override def deserialize(bytes: Array[Byte]): Array[Long] = {
+    val length = bytes.length / 8
+    val longArray = new Array[Long](length)
+    (0 until length).foreach { i =>


cloud-fan · 2017-10-22T07:39:41Z

...la/org/apache/spark/sql/catalyst/expressions/aggregate/ApproxCountDistinctForIntervals.scala

+  }
+
+  override def deserialize(bytes: Array[Byte]): Array[Long] = {
+    val length = bytes.length / 8


add assert(bytes.length % 8 == 0)

cloud-fan · 2017-10-22T07:40:34Z

LGTM except a few minor comments

SparkQA · 2017-10-23T03:43:38Z

Test build #82966 has finished for PR 19506 at commit 1e95a2f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-10-23T22:02:49Z

thanks, merging to master!

Zhenhua Wang added 4 commits October 13, 2017 15:04

implement ApproxCountDistinctForIntervals as TypedImperativeAggregate

f662394

remove offset

1c3e18a

fix withOffset return type

792b58a

add test for large number of endpoints

652b301

cloud-fan reviewed Oct 19, 2017

View reviewed changes

Zhenhua Wang added 3 commits October 20, 2017 13:33

merge master

ba75112

simplify by creating LongArrayInternalRow

7ba0883

Revert "merge master"

2d1b070

This reverts commit ba75112.

Zhenhua Wang added 2 commits October 21, 2017 16:31

use unsafe writing

1b75428

add serde test case

49b4ac2

cloud-fan reviewed Oct 22, 2017

View reviewed changes

use while loop

1e95a2f

asfgit closed this in f6290ae Oct 23, 2017

[SPARK-22285] [SQL] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate #19506

[SPARK-22285] [SQL] Change implementation of ApproxCountDistinctForIntervals to TypedImperativeAggregate #19506

Uh oh!

Conversation

wzhfy commented Oct 16, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Oct 16, 2017

Uh oh!

wzhfy commented Oct 16, 2017

Uh oh!

SparkQA commented Oct 16, 2017

Uh oh!

SparkQA commented Oct 16, 2017

Uh oh!

cloud-fan Oct 19, 2017

Choose a reason for hiding this comment

Uh oh!

wzhfy Oct 21, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 19, 2017

Choose a reason for hiding this comment

Uh oh!

wzhfy Oct 20, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 20, 2017

Uh oh!

SparkQA commented Oct 21, 2017

Uh oh!

SparkQA commented Oct 21, 2017

Uh oh!

cloud-fan Oct 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy Oct 23, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 22, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 22, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 22, 2017

Uh oh!

SparkQA commented Oct 23, 2017

Uh oh!

cloud-fan commented Oct 23, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wzhfy commented Oct 16, 2017 •

edited

Loading

cloud-fan Oct 22, 2017 •

edited

Loading