-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17306] [SQL] QuantileSummaries doesn't compress #14976
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,7 +17,7 @@ | |
|
|
||
| package org.apache.spark.sql.catalyst.util | ||
|
|
||
| import scala.collection.mutable.ArrayBuffer | ||
| import scala.collection.mutable.{ArrayBuffer, ListBuffer} | ||
|
|
||
| import org.apache.spark.sql.catalyst.util.QuantileSummaries.Stats | ||
|
|
||
|
|
@@ -61,7 +61,12 @@ class QuantileSummaries( | |
| def insert(x: Double): QuantileSummaries = { | ||
| headSampled += x | ||
| if (headSampled.size >= defaultHeadSize) { | ||
| this.withHeadBufferInserted | ||
| val result = this.withHeadBufferInserted | ||
| if (result.sampled.length >= compressThreshold) { | ||
| result.compress() | ||
| } else { | ||
| result | ||
| } | ||
| } else { | ||
| this | ||
| } | ||
|
|
@@ -236,7 +241,7 @@ object QuantileSummaries { | |
| if (currentSamples.isEmpty) { | ||
| return Array.empty[Stats] | ||
| } | ||
| val res: ArrayBuffer[Stats] = ArrayBuffer.empty | ||
| val res = ListBuffer.empty[Stats] | ||
|
||
| // Start for the last element, which is always part of the set. | ||
| // The head contains the current new head, that may be merged with the current element. | ||
| var head = currentSamples.last | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC @thunterdb -- is this the right fix?
I also 'adjusted' calls to .append() which is actually a varargs method; += appends an element
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen
I think the compression decision need to be related to relative error setting. (The smaller the relative error is, the less frequent we do compression)
When implementing aggregation function percentile_approx, I have implemented compression like this:
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala#L214
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, would you mind change
ApproximatePercentilealtogether, the compression in ApproximatePercentile will not be necessary if we have done compression atQuantileSummariesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is the fix, thanks for doing it. I had never realized that
.appendtakes a vararg input, thanks for the hint.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clockfly oh, hm, this code just landed recently? it seems to call
compress()itself rather than leave it toQuantileSummaries, in which case I'm not clear why there's acompressThresholdinQuantileSummariesIt seems like the new class is trying to manage it. What's the right way to rationalize this -- are you sayingQuantileSummariesshouldn't manage compression at all? that's fine too (in which case this can just turn into a very small optimization change).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@clockfly @srowen the compression threshold is just here to amortize the cost of performing compression. If you wanted to, you could run compression every iteration (it is an idempotent operation). Internally, the
compressmethod uses a merging threshold that indeeds depends on the number of elements seen, but it operates on a number of samples that is bounded by O(1/\epsi).This patch will work. @clockfly I suspect some of the wrappers done in the Approximate percentile are not required either, once I submit a PR that fixes an off-by-1 error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen @thunterdb
I think the compression still need to be done in QuantileSummaries. I added some compression implementation in wrapper class ApproximatePercentile because ApproximatePercentile need to know whether the QuantileSummaries is compressed or not, otherwise ApproximatePercentile don't know whether it is OK to call
def query(quantile: Double)of QuantileSummaries.Maybe QuantileSummaries should expose an API like "isCompressed"? So that the caller can skip calling compress if QuantileSummaries is already compressed.