Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

@jiangxb1987 jiangxb1987 commented Jul 11, 2016

What changes were proposed in this pull request?

Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].

How was this patch tested?

Add a new testsuite PercentileSuite to test percentile directly.
Updated related testcases in ExpressionToSQLSuite.

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62088 has finished for PR 14136 at commit 1ae3df7.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class Percentile(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Style

@SparkQA
Copy link

SparkQA commented Jul 11, 2016

Test build #62089 has finished for PR 14136 at commit d5d4fa9.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a while loop here; for is not that efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'll update that.

@hvanhovell
Copy link
Contributor

@jiangxb1987 Thanks for working on this. I did a quick pass, and it is a good start. I have a few issues:

  • I am a bit concerned about the memory characteristics. The worst case scenario would be that all data gets moved to a single executor. Could you document this behavior?
  • Could you add support for all numeric types?
  • Could you add SQL tests as well?

@hvanhovell
Copy link
Contributor

A more performant way of this would be to plan this using a combination of count grouped by the percentile key, this percentile function. I am not sure if we should pursue that for this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiangxb1987 I am just curious about why we use OpenHashMap here instead of using mutable.Map to correspond with code here in hive. Is there any specific reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenHashMap is typically faster and has less overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell Thanks!

@jiangxb1987
Copy link
Contributor Author

@hvanhovell Thank you for your kindly review, the suggestions are quite useful for me. I'll try to get some time later today to update some fixes. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62313 has finished for PR 14136 at commit 4914174.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

@hvanhovell I've fixed most of the problems mentioned above, and I also added basic tests and comments as you required. Please find some time to do a pass, thanks!

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62314 has finished for PR 14136 at commit bf6f539.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 14, 2016

Test build #62315 has finished for PR 14136 at commit 2194c9e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we check here if a percentile is valid? Waiting until eval is really late in the game.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check if the array is not empty.

@vectorijk
Copy link
Contributor

@SparkQA
Copy link

SparkQA commented Jul 16, 2016

Test build #62407 has finished for PR 14136 at commit 62324d6.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 16, 2016

Test build #62408 has finished for PR 14136 at commit 19011ab.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62455 has finished for PR 14136 at commit d541b46.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2016

Test build #62466 has finished for PR 14136 at commit 6314611.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69110 has started for PR 14136 at commit 4ace3bc.

@jiangxb1987
Copy link
Contributor Author

retest this please.

@SparkQA
Copy link

SparkQA commented Nov 24, 2016

Test build #69121 has finished for PR 14136 at commit 4ace3bc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2016

Test build #69144 has finished for PR 14136 at commit e01d0b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class CountingsSerializer

Countings()
}

private def evalPercentages(expr: Expression): (Boolean, Seq[Number]) = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return doubles?

copy(inputAggBufferOffset = newInputAggBufferOffset)

// Mark as lazy so that percentageExpression is not evaluated during tree transformation.
private lazy val (returnPercentileArray: Boolean, percentages: Seq[Number]) =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be problematic with serialization. Just put the percentages in a @transient lazy val and inline the use of returnPercentileArray.

override def nullable: Boolean = true

override def dataType: DataType =
if (returnPercentileArray) ArrayType(DoubleType) else DoubleType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should return the type of the input. We can always interpolate the value and cast that to the input type. Is this is different from what Hive does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIVE could return double value or array of double values even the column dataType is integer, for example:

hive> insert into tbl values(1,2,5,10);
hive> insert into tbl values(1),(2),(5),(10);
hive> select percentile(a, array(0, 0.25, 0.5, 0.75, 1)) from tbl;
[1.0,1.75,3.5,6.25,10.0]

// Returns null for empty inputs
override def nullable: Boolean = true

override def dataType: DataType =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override lazy val dataType: DataType = percentageExpression.dataType match {
  case _: ArrayType => ArrayType(DoubleType, false)
  case _ => DoubleType
}

Seq(NumericType, TypeCollection(NumericType, ArrayType))

override def checkInputDataTypes(): TypeCheckResult =
TypeUtils.checkForNumericExpr(child.dataType, "function percentile")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call super.checkInputDataTypes(), that will validate the inputTypes(). Also check the percentageExpression, that must foldable and the percentage(s) must be in the range [0, 1].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW - you can make the analyzer add casts for you:

override def inputTypes: Seq[AbstractDataType] = percentageExpression.dataType match {
  case _: ArrayType => Seq(NumericType, ArrayType(DoubleType, false))
  case _ => Seq(NumericType, DoubleType)
}

Then you are alway sure you get a double or a double array for the percentageExpression.

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass. My main feedback is to consolidate this more in a single class.

/**
* A class that stores the numbers and their counts, used to support [[Percentile]] function.
*/
class Countings(val counts: OpenHashMap[Number, Long]) extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove this class and put its implementation in the Percentile Aggregate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class TypedImperativeAggregate[T] requires access of this class, so perhaps we should keep it outside of the Percentile.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could entirely remove the class Countings.

*/
class CountingsSerializer {

final def serialize(obj: Countings, dataType: DataType): Array[Byte] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put this in the Percentile class.

return Seq.empty
}

val sortedCounts = counts.toSeq.sortBy(_._1)(new Ordering[Number]() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use child.asInstanceOf[NumericType].ordering.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a dumb question: How can we order a sequence of Number using the Ordering[NumericType#InternalType] ?

Copy link
Contributor

@hvanhovell hvanhovell Nov 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could cast the ordering?

override def compare(a: Number, b: Number): Int =
scala.math.signum(a.doubleValue() - b.doubleValue()).toInt
})
val aggreCounts = sortedCounts.scanLeft(sortedCounts.head._1, 0L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just use an imperative loop.

val lower = position.floor
val higher = position.ceil

// Linear search since this won't take much time from the total execution anyway
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't make it right :)... Anyway there are enough binarySearch implementations around. So maybe use one of those.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was taken from Hive UDAFPercentile. It is fine if you do that, but please acknowledge that you have done so by adding a line of documentation. See this for example: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L524

counts.foreach { pair =>
val row = InternalRow.apply(pair._1, pair._2)
val unsafeRow = projection.apply(row)
buffer ++= unsafeRow.getBytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely expensive, because you are resizing the buffer for every entry. Please use a ByteArrayOutputStream and a DataOutputStream. See this for an example: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L226-L239


// Read the pairs of counts map
val row = new UnsafeRow(2)
val pairRowSizeInBytes = UnsafeRow.calculateFixedPortionByteSize(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might cause an issue for a DecimalType, a decimal does not have to be fixed. I think we need to write out row sizes or not allow variable length keys. BTW if you only allow fixed length keys, you could get rid of UnsafeRows and projections and directly use a DataOutputStream.

Countings()
}

private def evalPercentages(expr: Expression): Seq[Double] = (expr.dataType, expr.eval()) match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to the definition of percentages. You can also make this much simpler. The analyzer guarantees that you either get a single double, or an ArrayData of double:

@transient
private lazy val percentages = percentageExpression.eval() match {
  case p: Double => Seq(p)
  case a: ArrayData => a.toDoubleArray().toSeq
}

copy(inputAggBufferOffset = newInputAggBufferOffset)

// Mark as lazy so that percentageExpression is not evaluated during tree transformation.
private lazy val returnPercentileArray = percentageExpression.dataType.isInstanceOf[ArrayType]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark it @transient.

defaultCheck
} else if (!percentageExpression.foldable) {
// percentageExpression must be foldable
TypeCheckFailure(s"The percentage(s) must be a constant literal, " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit no string interpolation.

} else if (!percentageExpression.foldable) {
// percentageExpression must be foldable
TypeCheckFailure(s"The percentage(s) must be a constant literal, " +
s"but got ${percentageExpression}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you don't need {...}?

@SparkQA
Copy link

SparkQA commented Nov 26, 2016

Test build #69186 has finished for PR 14136 at commit b0aabf9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2016

Test build #69188 has finished for PR 14136 at commit 5b8cd4d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val sortedCounts = buffer.toSeq.sortBy(_._1)(
child.dataType.asInstanceOf[NumericType].ordering.asInstanceOf[Ordering[Number]])
val aggreCounts = sortedCounts.scanLeft(sortedCounts.head._1, 0L) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe accumlatedCounts is a slightly better name than aggreCounts here?

@jiangxb1987
Copy link
Contributor Author

Currently ImplicitTypeCasts doesn't support cast between ArrayType(elementType)s, so we have to support ArrayType(NumericType) for now. When we have add that support, we could make the code for analyze percentageExpression more concise.

@SparkQA
Copy link

SparkQA commented Nov 28, 2016

Test build #69233 has finished for PR 14136 at commit 3c699ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in 0f5f52a Nov 28, 2016
asfgit pushed a commit that referenced this pull request Nov 28, 2016
## What changes were proposed in this pull request?

Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].

## How was this patch tested?

Add a new testsuite `PercentileSuite` to test percentile directly.
Updated related testcases in `ExpressionToSQLSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>
Author: 蒋星博 <jiangxingbo@meituan.com>
Author: jiangxingbo <jiangxingbo@meituan.com>

Closes #14136 from jiangxb1987/percentile.

(cherry picked from commit 0f5f52a)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>
@hvanhovell
Copy link
Contributor

LGTM. Merging to master/2.1. Thanks!

@rxin
Copy link
Contributor

rxin commented Nov 28, 2016

@hvanhovell why did this go into branch-2.1? It's way past branch cut time.

@hvanhovell
Copy link
Contributor

@rxin this is a very contained patch. It only adds the percentile function. The advantage here is that this further reduces our dependency on relatively slow Hive UDAFs (one more to go), and that we don't have to put in horrible patches like #16034.

robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 2, 2016
## What changes were proposed in this pull request?

Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].

## How was this patch tested?

Add a new testsuite `PercentileSuite` to test percentile directly.
Updated related testcases in `ExpressionToSQLSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>
Author: 蒋星博 <jiangxingbo@meituan.com>
Author: jiangxingbo <jiangxingbo@meituan.com>

Closes apache#14136 from jiangxb1987/percentile.
@jiangxb1987 jiangxb1987 deleted the percentile branch December 20, 2016 02:19
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Implement percentile SQL function. It computes the exact percentile(s) of expr at pc with range in [0, 1].

## How was this patch tested?

Add a new testsuite `PercentileSuite` to test percentile directly.
Updated related testcases in `ExpressionToSQLSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>
Author: 蒋星博 <jiangxingbo@meituan.com>
Author: jiangxingbo <jiangxingbo@meituan.com>

Closes apache#14136 from jiangxb1987/percentile.
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 6, 2017

Hi @rxin, I was reading related codes around this and saw - #14136 (comment).

It looks many suggestions for calculating median are workarounds (e.g., https://stackoverflow.com/a/31437177).

I want to use approximate_percentile or percentile in groupby/pivot and I tried to deal with this problem for few days. I ended up with a weird code as below, e.g.,:

from pyspark.sql.functions import *
from pyspark.sql.column import Column, _to_java_column


def approximate_percentile(child, percentage, accuracy=lit(10000)):
    percentile_expr = spark.sparkContext._jvm.org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
    child_expr = _to_java_column(child).expr()
    percentage_expr = _to_java_column(percentage).expr()
    accuracy_expr = _to_java_column(accuracy).expr()
    agg_func = percentile_expr(child_expr, percentage_expr, accuracy_expr)
    return Column(spark._jvm.org.apache.spark.sql.Column(agg_func.toAggregateExpression()))


spark.range(1).groupby().agg(approximate_percentile(col("id"), lit(0.5))).show()
spark.range(1).groupby().pivot("id").agg(approximate_percentile(col("id"), lit(0.5))).show()

This code might be easily broken by Spark version as it accesses to internal packages via JVM. I use
Scala/R but also PySpark specifically to avoid version compatibility problem (even between Spark 1.6.x and Spark 2.x) in many cases but this one now becomes flaky.

Another alternative should be to port existing logic in application side to SQL ones but I was wondering if I really should do this for single case.

It might be expensive but exposing it might also promote users to test this at least.

Could we expose this in Scala/Python/R? It should be pretty easy to expose this. Or, did I misunderstand the context and other workarounds?

cc @srowen and @zero323 who I saw answered to the questions related with this outside (e.g., stackoverflow).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants