-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-2862] histogram method fails on some choices of bucketCount #1787
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
lgtm |
|
Jenkins, test this please. |
|
QA tests have started for PR 1787. This patch merges cleanly. |
|
QA results for PR 1787: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a documented Scala bug? If so - can we link to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have the Scala bug raised today. Will update it once done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K - thanks! if we cast link to an issue or something, would be great.
|
Jenkins, test this please. |
1 similar comment
|
Jenkins, test this please. |
|
QA tests have started for PR 1787. This patch merges cleanly. |
|
QA results for PR 1787: |
|
Jenkins, retest this please. |
|
QA tests have started for PR 1787. This patch merges cleanly. |
|
QA results for PR 1787: |
|
This seems to be legitimately breaking a test. |
|
Sorry couldn't look into this yesterday. I plan to fix this today. |
|
Scala's NumericRange and the range generated using the 'to' method should produce similar results but there are rounding differences. I need to spend more time investigating the cause. Meanwhile, the Scala bug is documented at https://issues.scala-lang.org/browse/SI-8782. |
|
Yeah, the problem is now this, essentially: The problem is that naively adding the increment makes for increasing rounding errors as the range goes on. At 100 slices: Here's an attempt to write a version that is both more accurate and forces the end of the range to be correct: A little bit inelegant, and still not perfect, but better and fixes the original problem still: What do you guys think? @nrchandan |
|
@srowen Your approach should work. I will give it a try tomorrow. |
|
@pwendell @srowen This version passes all test cases. Also added a new test case (the one specified in JIRA #SPARK-2862) |
|
Jenkins, test this please. |
|
QA tests have started for PR 1787. This patch merges cleanly. |
|
QA results for PR 1787: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can save some FLOPs by computing the stepSize first:
val stepSize = (max - min) / steps
Range.Int(0, steps, 1).map(s => min + s * stepSize) :+ max
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the order of operations would be important to get the result, so wrote it that way on purpose, but, it's actually fine this way too. I think the win here is not repeatedly adding 1/steps alone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches should give reasonable results because the last one is skipped. The edge case is that both min and max are very very small values (customRange1 computes stepSize first):
scala> customRange(1e-322, 2e-322, 10)
res15: IndexedSeq[Double] = Vector(1.0E-322, 1.1E-322, 1.2E-322, 1.3E-322, 1.4E-322, 1.5E-322, 1.58E-322, 1.7E-322, 1.8E-322, 1.9E-322, 2.0E-322)
scala> customRange1(1e-322, 2e-322, 10)
res14: IndexedSeq[Double] = Vector(1.0E-322, 1.1E-322, 1.2E-322, 1.3E-322, 1.4E-322, 1.5E-322, 1.58E-322, 1.7E-322, 1.8E-322, 1.9E-322, 2.0E-322)
scala> customRange(1e-323, 2e-323, 10)
res12: IndexedSeq[Double] = Vector(1.0E-323, 1.0E-323, 1.0E-323, 1.5E-323, 1.5E-323, 1.5E-323, 1.5E-323, 1.5E-323, 2.0E-323, 2.0E-323, 2.0E-323)
scala> customRange1(1e-323, 2e-323, 10)
res13: IndexedSeq[Double] = Vector(1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 1.0E-323, 2.0E-323)
I think we can ignore this edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
customRange1 actually fails the original test case. Continuing with customRange code by @srowen
scala> customRange1(1.0, 2.0, 10)
res2: IndexedSeq[Double] = Vector(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7000000000000002, 1.8, 1.9, 2.0)
scala> customRange(1.0, 2.0, 10)
res0: IndexedSeq[Double] = Vector(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually both are correct in this case. This is why we should tolerate small numeric difference in tests. @srowen 's version looks good.
[SPARK-2862] Fix a typo, add a test case, modify a test case
|
Jenkins, test this please. |
|
QA tests have started for PR 1787 at commit
|
|
QA tests have finished for PR 1787 at commit
|
|
LGTM. Merged into master and branch-1.1. Thanks!! |
Author: Chandan Kumar <chandan.kumar@imaginea.com> Closes #1787 from nrchandan/spark-2862 and squashes the following commits: a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases 4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id 13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug (cherry picked from commit f45efbb) Signed-off-by: Xiangrui Meng <meng@databricks.com>
Author: Chandan Kumar <chandan.kumar@imaginea.com> Closes apache#1787 from nrchandan/spark-2862 and squashes the following commits: a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases 4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id 13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug
No description provided.