[SPARK-2862] histogram method fails on some choices of bucketCount #1787

nrchandan · 2014-08-05T12:19:29Z

No description provided.

AmplabJenkins · 2014-08-05T12:22:46Z

Can one of the admins verify this patch?

davies · 2014-08-06T00:36:42Z

lgtm

rxin · 2014-08-06T01:22:19Z

Jenkins, test this please.

SparkQA · 2014-08-06T01:24:33Z

QA tests have started for PR 1787. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17968/consoleFull

SparkQA · 2014-08-06T01:25:21Z

QA results for PR 1787:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17968/consoleFull

pwendell · 2014-08-06T01:26:51Z

core/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala

Is this a documented Scala bug? If so - can we link to it?

I should have the Scala bug raised today. Will update it once done.

K - thanks! if we cast link to an issue or something, would be great.

nrchandan · 2014-08-06T07:07:07Z

Added Scala bug ID. Fixed the coding convention. Ready to retest. Cc @davies @rxin

pwendell · 2014-08-06T07:29:31Z

Jenkins, test this please.

pwendell · 2014-08-06T23:58:39Z

Jenkins, test this please.

SparkQA · 2014-08-07T00:03:47Z

QA tests have started for PR 1787. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18084/consoleFull

SparkQA · 2014-08-07T00:49:20Z

QA results for PR 1787:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18084/consoleFull

pwendell · 2014-08-07T00:52:55Z

Jenkins, retest this please.

SparkQA · 2014-08-07T00:59:24Z

QA tests have started for PR 1787. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18088/consoleFull

SparkQA · 2014-08-07T01:43:48Z

QA results for PR 1787:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18088/consoleFull

pwendell · 2014-08-08T01:55:52Z

This seems to be legitimately breaking a test.

nrchandan · 2014-08-08T05:25:04Z

Sorry couldn't look into this yesterday. I plan to fix this today.

nrchandan · 2014-08-08T12:01:36Z

Scala's NumericRange and the range generated using the 'to' method should produce similar results but there are rounding differences. I need to spend more time investigating the cause. Meanwhile, the Scala bug is documented at https://issues.scala-lang.org/browse/SI-8782.

srowen · 2014-08-08T13:20:26Z

Yeah, the problem is now this, essentially:

scala> (1.0 to (2.0, 1.0/10.0)).toArray
res1: Array[Double] = Array(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7000000000000002, 1.8, 1.9, 2.0)

The problem is that naively adding the increment makes for increasing rounding errors as the range goes on. At 100 slices:

scala> (1.0 to (2.0, 1.0/100.0)).takeRight(10)
res4: scala.collection.immutable.IndexedSeq[Double] = Vector(1.9100000000000008, 1.9200000000000008, 1.9300000000000008, 1.9400000000000008, 1.9500000000000008, 1.9600000000000009, 1.9700000000000009, 1.9800000000000009, 1.9900000000000009, 2.000000000000001)

Here's an attempt to write a version that is both more accurate and forces the end of the range to be correct:

def range(min: Double, max: Double, steps: Int): IndexedSeq[Double] = {
  val span = max - min
  Range.Int(0, steps, 1).map(s => min + (s * span) / steps) :+ max
}

A little bit inelegant, and still not perfect, but better and fixes the original problem still:

scala> range(1.0, 2.0, 10)
res5: IndexedSeq[Double] = Vector(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)

scala> range(1.0, 2.0, 100).takeRight(10)
res7: IndexedSeq[Double] = Vector(1.9100000000000001, 1.92, 1.9300000000000002, 1.94, 1.95, 1.96, 1.97, 1.98, 1.99, 2.0)

What do you guys think? @nrchandan

nrchandan · 2014-08-11T09:38:45Z

@srowen Your approach should work. I will give it a try tomorrow.

nrchandan · 2014-08-13T09:50:11Z

@pwendell @srowen This version passes all test cases. Also added a new test case (the one specified in JIRA #SPARK-2862)

pwendell · 2014-08-13T20:20:25Z

Jenkins, test this please.

SparkQA · 2014-08-13T20:24:51Z

QA tests have started for PR 1787. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18462/consoleFull

SparkQA · 2014-08-13T21:16:02Z

QA results for PR 1787:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18462/consoleFull

mengxr · 2014-08-16T22:45:46Z

core/src/main/scala/org/apache/spark/rdd/DoubleRDDFunctions.scala

We can save some FLOPs by computing the stepSize first:

val stepSize = (max - min) / steps Range.Int(0, steps, 1).map(s => min + s * stepSize) :+ max

I thought the order of operations would be important to get the result, so wrote it that way on purpose, but, it's actually fine this way too. I think the win here is not repeatedly adding 1/steps alone.

Both approaches should give reasonable results because the last one is skipped. The edge case is that both min and max are very very small values (customRange1 computes stepSize first):

scala> customRange(1e-322, 2e-322, 10) res15: IndexedSeq[Double] = Vector(1.0E-322, 1.1E-322, 1.2E-322, 1.3E-322, 1.4E-322, 1.5E-322, 1.58E-322, 1.7E-322, 1.8E-322, 1.9E-322, 2.0E-322) scala> customRange1(1e-322, 2e-322, 10) res14: IndexedSeq[Double] = Vector(1.0E-322, 1.1E-322, 1.2E-322, 1.3E-322, 1.4E-322, 1.5E-322, 1.58E-322, 1.7E-322, 1.8E-322, 1.9E-322, 2.0E-322) scala> customRange(1e-323, 2e-323, 10) res12: IndexedSeq[Double] = Vector(1.0E-323, 1.0E-323, 1.0E-323, 1.5E-323, 1.5E-323, 1.5E-323, 1.5E-323, 1.5E-323, 2.0E-323, 2.0E-323, 2.0E-323) scala> customRange1(1e-323, 2e-323, 10) res13: IndexedSeq[Double] = Vector(1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 2.0E-323)

I think we can ignore this edge case.

customRange1 actually fails the original test case. Continuing with customRange code by @srowen

scala> customRange1(1.0, 2.0, 10)
res2: IndexedSeq[Double] = Vector(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7000000000000002, 1.8, 1.9, 2.0)

scala> customRange(1.0, 2.0, 10)
res0: IndexedSeq[Double] = Vector(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)

Actually both are correct in this case. This is why we should tolerate small numeric difference in tests. @srowen 's version looks good.

[SPARK-2862] Fix a typo, add a test case, modify a test case

mengxr · 2014-08-18T08:17:32Z

Jenkins, test this please.

SparkQA · 2014-08-18T08:20:13Z

QA tests have started for PR 1787 at commit a76bbf6.

This patch merges cleanly.

SparkQA · 2014-08-18T09:12:25Z

QA tests have finished for PR 1787 at commit a76bbf6.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-08-18T16:53:38Z

LGTM. Merged into master and branch-1.1. Thanks!!

Author: Chandan Kumar <chandan.kumar@imaginea.com> Closes #1787 from nrchandan/spark-2862 and squashes the following commits: a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases 4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id 13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug (cherry picked from commit f45efbb) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Author: Chandan Kumar <chandan.kumar@imaginea.com> Closes apache#1787 from nrchandan/spark-2862 and squashes the following commits: a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases 4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id 13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug

[SPARK-2862] Use shorthand range notation to avoid Scala bug

13854f1

pwendell reviewed Aug 6, 2014
View reviewed changes

[SPARK-2862] Add Scala bug id

4211eea

nrchandan changed the title ~~[SPARK-2862] Use shorthand range notation to avoid Scala bug~~ [SPARK-2862] histogram method fails on some choices of bucketCount Aug 14, 2014

mengxr reviewed Aug 16, 2014
View reviewed changes

[SPARK-2862] Fix for a broken test case and add new test cases

a76bbf6

[SPARK-2862] Fix a typo, add a test case, modify a test case

asfgit closed this in f45efbb Aug 18, 2014

zsxwing mentioned this pull request Sep 23, 2015

[SPARK-10224][Streaming]Fix the issue that blockIntervalTimer won't call updateCurrentBuffer when stopping #8417

Closed

[SPARK-2862] histogram method fails on some choices of bucketCount #1787

[SPARK-2862] histogram method fails on some choices of bucketCount #1787

Uh oh!

Conversation

nrchandan commented Aug 5, 2014

Uh oh!

AmplabJenkins commented Aug 5, 2014

Uh oh!

davies commented Aug 6, 2014

Uh oh!

rxin commented Aug 6, 2014

Uh oh!

SparkQA commented Aug 6, 2014

Uh oh!

SparkQA commented Aug 6, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nrchandan commented Aug 6, 2014

Uh oh!

pwendell commented Aug 6, 2014

Uh oh!

pwendell commented Aug 6, 2014

Uh oh!

SparkQA commented Aug 7, 2014

Uh oh!

SparkQA commented Aug 7, 2014

Uh oh!

pwendell commented Aug 7, 2014

Uh oh!

SparkQA commented Aug 7, 2014

Uh oh!

SparkQA commented Aug 7, 2014

Uh oh!

pwendell commented Aug 8, 2014

Uh oh!

nrchandan commented Aug 8, 2014

Uh oh!

nrchandan commented Aug 8, 2014

Uh oh!

srowen commented Aug 8, 2014

Uh oh!

nrchandan commented Aug 11, 2014

Uh oh!

nrchandan commented Aug 13, 2014

Uh oh!

pwendell commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

SparkQA commented Aug 13, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Aug 18, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

SparkQA commented Aug 18, 2014

Uh oh!

mengxr commented Aug 18, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development