[SPARK-16282][SQL] Implement percentile SQL function. #14136

jiangxb1987 · 2016-07-11T11:16:59Z

What changes were proposed in this pull request?

Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].

How was this patch tested?

Add a new testsuite PercentileSuite to test percentile directly.
Updated related testcases in ExpressionToSQLSuite.

SparkQA · 2016-07-11T11:18:58Z

Test build #62088 has finished for PR 14136 at commit 1ae3df7.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class Percentile(

hvanhovell · 2016-07-11T11:29:37Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

SparkQA · 2016-07-11T11:31:57Z

Test build #62089 has finished for PR 14136 at commit d5d4fa9.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-07-11T11:32:10Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

Please use a while loop here; for is not that efficient.

Yep, I'll update that.

hvanhovell · 2016-07-11T12:18:36Z

@jiangxb1987 Thanks for working on this. I did a quick pass, and it is a good start. I have a few issues:

I am a bit concerned about the memory characteristics. The worst case scenario would be that all data gets moved to a single executor. Could you document this behavior?
Could you add support for all numeric types?
Could you add SQL tests as well?

hvanhovell · 2016-07-11T12:23:51Z

A more performant way of this would be to plan this using a combination of count grouped by the percentile key, this percentile function. I am not sure if we should pursue that for this PR.

vectorijk · 2016-07-11T15:56:44Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

@jiangxb1987 I am just curious about why we use OpenHashMap here instead of using mutable.Map to correspond with code here in hive. Is there any specific reason?

OpenHashMap is typically faster and has less overhead.

@hvanhovell Thanks!

jiangxb1987 · 2016-07-11T16:44:30Z

@hvanhovell Thank you for your kindly review, the suggestions are quite useful for me. I'll try to get some time later today to update some fixes. Thanks!

SparkQA · 2016-07-14T11:34:10Z

Test build #62313 has finished for PR 14136 at commit 4914174.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2016-07-14T11:35:04Z

@hvanhovell I've fixed most of the problems mentioned above, and I also added basic tests and comments as you required. Please find some time to do a pass, thanks!

SparkQA · 2016-07-14T11:44:11Z

Test build #62314 has finished for PR 14136 at commit bf6f539.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-14T13:21:02Z

Test build #62315 has finished for PR 14136 at commit 2194c9e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2016-07-14T15:05:35Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

Shouldn't we check here if a percentile is valid? Waiting until eval is really late in the game.

We should also check if the array is not empty.

vectorijk · 2016-07-14T16:14:28Z

We also need to remove line here https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveSessionCatalog.scala#L240.

SparkQA · 2016-07-16T11:33:49Z

Test build #62407 has finished for PR 14136 at commit 62324d6.

This patch fails Scala style tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-07-16T13:29:56Z

Test build #62408 has finished for PR 14136 at commit 19011ab.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T09:57:58Z

Test build #62455 has finished for PR 14136 at commit d541b46.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-18T13:19:42Z

Test build #62466 has finished for PR 14136 at commit 6314611.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-24T06:32:31Z

Test build #69110 has started for PR 14136 at commit 4ace3bc.

jiangxb1987 · 2016-11-24T08:21:24Z

retest this please.

SparkQA · 2016-11-24T10:22:43Z

Test build #69121 has finished for PR 14136 at commit 4ace3bc.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-25T06:48:47Z

Test build #69144 has finished for PR 14136 at commit e01d0b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class CountingsSerializer

hvanhovell · 2016-11-25T17:17:48Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+    Countings()
+  }
+
+  private def evalPercentages(expr: Expression): (Boolean, Seq[Number]) = {


Why not return doubles?

hvanhovell · 2016-11-25T17:22:18Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+    copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  // Mark as lazy so that percentageExpression is not evaluated during tree transformation.
+  private lazy val (returnPercentileArray: Boolean, percentages: Seq[Number]) =


This can be problematic with serialization. Just put the percentages in a @transient lazy val and inline the use of returnPercentileArray.

hvanhovell · 2016-11-25T17:24:05Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+  override def nullable: Boolean = true
+
+  override def dataType: DataType =
+    if (returnPercentileArray) ArrayType(DoubleType) else DoubleType


I think we should return the type of the input. We can always interpolate the value and cast that to the input type. Is this is different from what Hive does?

HIVE could return double value or array of double values even the column dataType is integer, for example:

hive> insert into tbl values(1,2,5,10); hive> insert into tbl values(1),(2),(5),(10); hive> select percentile(a, array(0, 0.25, 0.5, 0.75, 1)) from tbl; [1.0,1.75,3.5,6.25,10.0]

hvanhovell · 2016-11-25T17:26:05Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+  // Returns null for empty inputs
+  override def nullable: Boolean = true
+
+  override def dataType: DataType =


override lazy val dataType: DataType = percentageExpression.dataType match { case _: ArrayType => ArrayType(DoubleType, false) case _ => DoubleType }

hvanhovell · 2016-11-25T17:28:53Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+    Seq(NumericType, TypeCollection(NumericType, ArrayType))
+
+  override def checkInputDataTypes(): TypeCheckResult =
+    TypeUtils.checkForNumericExpr(child.dataType, "function percentile")


Call super.checkInputDataTypes(), that will validate the inputTypes(). Also check the percentageExpression, that must foldable and the percentage(s) must be in the range [0, 1].

hvanhovell · 2016-11-25T17:39:35Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

BTW - you can make the analyzer add casts for you:

override def inputTypes: Seq[AbstractDataType] = percentageExpression.dataType match { case _: ArrayType => Seq(NumericType, ArrayType(DoubleType, false)) case _ => Seq(NumericType, DoubleType) }

Then you are alway sure you get a double or a double array for the percentageExpression.

hvanhovell

I did another pass. My main feedback is to consolidate this more in a single class.

hvanhovell · 2016-11-25T17:41:14Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+  /**
+   * A class that stores the numbers and their counts, used to support [[Percentile]] function.
+   */
+  class Countings(val counts: OpenHashMap[Number, Long]) extends Serializable {


Please remove this class and put its implementation in the Percentile Aggregate.

The class TypedImperativeAggregate[T] requires access of this class, so perhaps we should keep it outside of the Percentile.

We could entirely remove the class Countings.

hvanhovell · 2016-11-25T17:42:38Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+   */
+  class CountingsSerializer {
+
+    final def serialize(obj: Countings, dataType: DataType): Array[Byte] = {


Just put this in the Percentile class.

hvanhovell · 2016-11-25T18:12:29Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+        return Seq.empty
+      }
+
+      val sortedCounts = counts.toSeq.sortBy(_._1)(new Ordering[Number]() {


Use child.asInstanceOf[NumericType].ordering.

Maybe a dumb question: How can we order a sequence of Number using the Ordering[NumericType#InternalType] ?

You could cast the ordering?

hvanhovell · 2016-11-25T18:13:44Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+        override def compare(a: Number, b: Number): Int =
+          scala.math.signum(a.doubleValue() - b.doubleValue()).toInt
+      })
+      val aggreCounts = sortedCounts.scanLeft(sortedCounts.head._1, 0L) {


Just use an imperative loop.

hvanhovell · 2016-11-25T19:28:43Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+      val lower = position.floor
+      val higher = position.ceil
+
+      // Linear search since this won't take much time from the total execution anyway


That doesn't make it right :)... Anyway there are enough binarySearch implementations around. So maybe use one of those.

This was taken from Hive UDAFPercentile. It is fine if you do that, but please acknowledge that you have done so by adding a line of documentation. See this for example: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L524

hvanhovell · 2016-11-25T20:11:52Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+      counts.foreach { pair =>
+        val row = InternalRow.apply(pair._1, pair._2)
+        val unsafeRow = projection.apply(row)
+        buffer ++= unsafeRow.getBytes


This is extremely expensive, because you are resizing the buffer for every entry. Please use a ByteArrayOutputStream and a DataOutputStream. See this for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L226-L239

hvanhovell · 2016-11-25T20:17:12Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+
+      // Read the pairs of counts map
+      val row = new UnsafeRow(2)
+      val pairRowSizeInBytes = UnsafeRow.calculateFixedPortionByteSize(2)


This might cause an issue for a DecimalType, a decimal does not have to be fixed. I think we need to write out row sizes or not allow variable length keys. BTW if you only allow fixed length keys, you could get rid of UnsafeRows and projections and directly use a DataOutputStream.

hvanhovell · 2016-11-26T14:28:50Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+    Countings()
+  }
+
+  private def evalPercentages(expr: Expression): Seq[Double] = (expr.dataType, expr.eval()) match {


Move this to the definition of percentages. You can also make this much simpler. The analyzer guarantees that you either get a single double, or an ArrayData of double:

@transient private lazy val percentages = percentageExpression.eval() match { case p: Double => Seq(p) case a: ArrayData => a.toDoubleArray().toSeq }

hvanhovell · 2016-11-26T14:33:51Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+    copy(inputAggBufferOffset = newInputAggBufferOffset)
+
+  // Mark as lazy so that percentageExpression is not evaluated during tree transformation.
+  private lazy val returnPercentileArray = percentageExpression.dataType.isInstanceOf[ArrayType]


Mark it @transient.

hvanhovell · 2016-11-26T14:34:53Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+      defaultCheck
+    } else if (!percentageExpression.foldable) {
+      // percentageExpression must be foldable
+      TypeCheckFailure(s"The percentage(s) must be a constant literal, " +


Nit no string interpolation.

hvanhovell · 2016-11-26T14:35:14Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+    } else if (!percentageExpression.foldable) {
+      // percentageExpression must be foldable
+      TypeCheckFailure(s"The percentage(s) must be a constant literal, " +
+        s"but got ${percentageExpression}")


Nit: you don't need {...}?

SparkQA · 2016-11-26T14:38:58Z

Test build #69186 has finished for PR 14136 at commit b0aabf9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-11-26T18:14:54Z

Test build #69188 has finished for PR 14136 at commit 5b8cd4d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

lins05 · 2016-11-27T10:02:36Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala

+
+    val sortedCounts = buffer.toSeq.sortBy(_._1)(
+      child.dataType.asInstanceOf[NumericType].ordering.asInstanceOf[Ordering[Number]])
+    val aggreCounts = sortedCounts.scanLeft(sortedCounts.head._1, 0L) {


nit: maybe accumlatedCounts is a slightly better name than aggreCounts here?

jiangxb1987 · 2016-11-28T09:04:41Z

Currently ImplicitTypeCasts doesn't support cast between ArrayType(elementType)s, so we have to support ArrayType(NumericType) for now. When we have add that support, we could make the code for analyze percentageExpression more concise.

SparkQA · 2016-11-28T11:14:28Z

Test build #69233 has finished for PR 14136 at commit 3c699ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

## What changes were proposed in this pull request? Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1]. ## How was this patch tested? Add a new testsuite `PercentileSuite` to test percentile directly. Updated related testcases in `ExpressionToSQLSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Author: 蒋星博 <jiangxingbo@meituan.com> Author: jiangxingbo <jiangxingbo@meituan.com> Closes #14136 from jiangxb1987/percentile. (cherry picked from commit 0f5f52a) Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

hvanhovell · 2016-11-28T19:08:52Z

LGTM. Merging to master/2.1. Thanks!

rxin · 2016-11-28T19:14:40Z

@hvanhovell why did this go into branch-2.1? It's way past branch cut time.

hvanhovell · 2016-11-28T19:41:20Z

@rxin this is a very contained patch. It only adds the percentile function. The advantage here is that this further reduces our dependency on relatively slow Hive UDAFs (one more to go), and that we don't have to put in horrible patches like #16034.

## What changes were proposed in this pull request? Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1]. ## How was this patch tested? Add a new testsuite `PercentileSuite` to test percentile directly. Updated related testcases in `ExpressionToSQLSuite`. Author: jiangxingbo <jiangxb1987@gmail.com> Author: 蒋星博 <jiangxingbo@meituan.com> Author: jiangxingbo <jiangxingbo@meituan.com> Closes apache#14136 from jiangxb1987/percentile.

HyukjinKwon · 2017-07-06T01:43:17Z

Hi @rxin, I was reading related codes around this and saw - #14136 (comment).

It looks many suggestions for calculating median are workarounds (e.g., https://stackoverflow.com/a/31437177).

I want to use approximate_percentile or percentile in groupby/pivot and I tried to deal with this problem for few days. I ended up with a weird code as below, e.g.,:

from pyspark.sql.functions import *
from pyspark.sql.column import Column, _to_java_column


def approximate_percentile(child, percentage, accuracy=lit(10000)):
    percentile_expr = spark.sparkContext._jvm.org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
    child_expr = _to_java_column(child).expr()
    percentage_expr = _to_java_column(percentage).expr()
    accuracy_expr = _to_java_column(accuracy).expr()
    agg_func = percentile_expr(child_expr, percentage_expr, accuracy_expr)
    return Column(spark._jvm.org.apache.spark.sql.Column(agg_func.toAggregateExpression()))


spark.range(1).groupby().agg(approximate_percentile(col("id"), lit(0.5))).show()
spark.range(1).groupby().pivot("id").agg(approximate_percentile(col("id"), lit(0.5))).show()

This code might be easily broken by Spark version as it accesses to internal packages via JVM. I use
Scala/R but also PySpark specifically to avoid version compatibility problem (even between Spark 1.6.x and Spark 2.x) in many cases but this one now becomes flaky.

Another alternative should be to port existing logic in application side to SQL ones but I was wondering if I really should do this for single case.

It might be expensive but exposing it might also promote users to test this at least.

Could we expose this in Scala/Python/R? It should be pretty easy to expose this. Or, did I misunderstand the context and other workarounds?

cc @srowen and @zero323 who I saw answered to the questions related with this outside (e.g., stackoverflow).

hvanhovell reviewed Jul 11, 2016
View reviewed changes

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala Outdated

Copy link

Contributor

hvanhovell Jul 11, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Style

hvanhovell reviewed Jul 11, 2016
View reviewed changes

vectorijk reviewed Jul 11, 2016
View reviewed changes

hvanhovell reviewed Jul 14, 2016
View reviewed changes

jiangxb1987 force-pushed the percentile branch from 19011ab to d541b46 Compare July 18, 2016 08:46

Implement serializer for Percentile.

e01d0b2

hvanhovell reviewed Nov 25, 2016

View reviewed changes

hvanhovell requested changes Nov 25, 2016

View reviewed changes

code refactor.

b0aabf9

hvanhovell reviewed Nov 26, 2016

View reviewed changes

remove the class Countings and CountingsSerializer

5b8cd4d

lins05 reviewed Nov 27, 2016

View reviewed changes

revert Percentile to accept percentages in ArrayType(NumericType).

3c699ad

hvanhovell mentioned this pull request Nov 28, 2016

[SPARK-18527][SQL] Convert decimal array to double array double for Hive UDAFPercentile #16034

Closed

asfgit closed this in 0f5f52a Nov 28, 2016

jiangxb1987 deleted the percentile branch December 20, 2016 02:19

[SPARK-16282][SQL] Implement percentile SQL function. #14136

[SPARK-16282][SQL] Implement percentile SQL function. #14136

Uh oh!

Conversation

jiangxb1987 commented Jul 11, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Jul 11, 2016

Uh oh!

hvanhovell commented Jul 11, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jul 11, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

jiangxb1987 commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 14, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vectorijk commented Jul 14, 2016

Uh oh!

SparkQA commented Jul 16, 2016

Uh oh!

SparkQA commented Jul 16, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Jul 18, 2016

Uh oh!

SparkQA commented Nov 24, 2016

Uh oh!

jiangxb1987 commented Nov 24, 2016

Uh oh!

SparkQA commented Nov 24, 2016

Uh oh!

SparkQA commented Nov 25, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jul 11, 2016 •

edited

Loading

hvanhovell Nov 26, 2016 •

edited

Loading