[SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 #26654

srowen · 2019-11-25T01:24:32Z

What changes were proposed in this pull request?

Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float.

Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways:

(Split source tree)
Inline the Scala 2.12 implementation
Reflection

For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this.

Why are the changes needed?

Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13.

TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators.

I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods.

Does this PR introduce any user-facing change?

Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order.

How was this patch tested?

Existing tests so far.

SparkQA · 2019-11-25T01:33:18Z

Test build #114364 has finished for PR 26654 at commit 2432d65.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-25T04:25:37Z

Test build #114367 has finished for PR 26654 at commit 57fc3f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2019-11-25T06:43:17Z

It seems reasonable to me.
BTW, how can we let Jenkins test with Scala 2.13 as well?

srowen · 2019-11-25T08:44:53Z

We can make a Jenkins job later; it's not nearly passing yet. I am testing locally on 2.13. There is at least one large change that I want to hold until after branch-3.0 is cut, and at least one we can't resolve yet. These changes are relatively easy ones with little or no impact on 2.12.

dongjoon-hyun · 2019-11-26T10:07:06Z

Shall we remove [WIP] from the PR title?

srowen · 2019-11-26T13:46:52Z

Yep, sounds like there's no objection to this approach.

dongjoon-hyun

+1, LGTM. Merged to master.
Thank you, @srowen and @gengliangwang !

gatorsmile · 2019-11-27T03:23:45Z

How about the code gen results? Are they consistent?

@cloud-fan @rednaxelafx

srowen · 2019-11-27T14:40:23Z

@gatorsmile good question. This change won't affect Scala 2.12 at all. How it affects 2.13 is still unknown; it isn't compilable just yet, so can't run the tests. The compile changes are large enough that I'm tackling them iteratively, but, at the end we may revise the 2.13 changes to ensure correctness, etc. This is a good example.

cloud-fan · 2019-11-27T14:55:47Z

In codegen we use Utils.nanSafeCompareDoubles. Shall we simply implement Ordering by our own using the same method? Then it's guaranteed that codegen and interpreted code paths have the same behavior.

srowen · 2019-11-27T15:32:29Z

Yeah, that method does what TotalOrdering does. We could adopt it instead of using Scala 2.13 TotalOrdering; it won't matter.

But the two sites that changed here used Scala 2.12's Ordering.Double, which does not work like that method. For the SorterSuite? won't matter. For numerics.scala, it could matter.

We can either leave this in place to maintain 2.12 behavior, or if we think that the current 2.12 behavior in numerics.scala is a bit wrong and inconsistent, we can instead standardize on the Utils method everywhere and remove the need for OrderingUtil. (There will be other reasons we need the parallel build tree though.)

WDYT?

cloud-fan · 2019-11-27T15:55:51Z

hmm, if the codegen and interpreted behavior is already inconsistent now with scala 2.12, it's a serious bug. @viirya @maropu can you take a look?

srowen · 2019-11-27T20:22:15Z

Some extra context. Here's the key difference between the two 2.13 orderings:

scala> Ordering.Double.TotalOrdering.compare(Double.NaN, Double.MaxValue)
res5: Int = 1
scala> Ordering.Double.TotalOrdering.gt(Double.NaN, Double.MaxValue)
res6: Boolean = true
scala> Ordering.Double.IeeeOrdering.compare(Double.NaN, Double.MaxValue)
res7: Int = 1
scala> Ordering.Double.IeeeOrdering.gt(Double.NaN, Double.MaxValue)
res8: Boolean = false

The weird thing is IeeeOrdering tries to work like the <, <=, etc operators, but still uses java.lang.Double.compare for compare, which seems internally inconsistent.

Scala 2.12's implementation of Ordering.Double looks the same as IeeeOrdering:

  trait DoubleOrdering extends Ordering[Double] {
    outer =>

    def compare(x: Double, y: Double) = java.lang.Double.compare(x, y)

    override def lteq(x: Double, y: Double): Boolean = x <= y
    override def gteq(x: Double, y: Double): Boolean = x >= y
    override def lt(x: Double, y: Double): Boolean = x < y
    override def gt(x: Double, y: Double): Boolean = x > y
   ...

TotalOrdering delegates all of that to the result of java.lang.Double.compare, which appears to already handle NaNs consistently by sorting them above all other non-NaN values.

Actually, do we need Utils.nanSafeCompareDoubles then, if it's the same?

So, back to the point: the question is whether this difference matters in the current build. The difference arises in a few places:

Double/FloatExactNumeric in numerics.scala used in Double/FloatType as their 'exactNumeric' implementation
Double/FloatType, which define an ordering based on Utils.nanSafeCompareXXX, but also use things like DoubleAsIfIntegral which inherits from DoubleOrdering.

I don't know if these two occurrences' ordering actually matters, as on a glance, it looks like these objects do not use them for ordering. I don't know the details enough to figure whether there is a subtle issue there.

I didn't immediately see a test which exercises something like sorting NaNs with non-NaN values, and am not sure how to write it to check code gen vs interpreted. But that would tell us one way or the other whether there is an issue.

In any event, it seems like:

The choice of TotalOrdering in 2.13 is good as it's consistent with Double.compare, Utils.nanSafeCompareDoubles
We will have to see whether 2.13 tests pass later! and revisit if not
We might even be able to remove nanSafeCompareDoubles for simplicity, but not important

viirya · 2019-11-27T22:03:46Z

core/src/main/scala-2.13/org/apache/spark/util/OrderingUtil.scala

+ * It functions like Ordering.Double.TotalOrdering in Scala 2.13, which matches java.lang.Double
+ * rather than Scala 2.12's Ordering.Double in handling of NaN.


hmm, doesn't Scala 2.12's Ordering.Double.compare delegates to java.lang.Double.compare?

why this matches java.lang.Double, but not Scala 2.12's Ordering.Double?

It does, for compare. The superclass Ordering then defines operations like lt, lteq, etc in terms of compare. But 2.12 Ordering.Double overrides them to use operations like <, <=. As far as I can tell it presents a consistent total ordering via compare already (as it appears java.lang.Double.compare does), but its comparisons aren't consistent with how NaNs behave in lt, etc. Then again... perhaps neither is Java. Its Comparators.natural() would work consistently, but the comparisons don't match the Java operators.

So I guess the question is, when to the Orderings get used by Spark? it's not immediately clear if they are, even. If they're used for sorting, all is well... I think, as sorting would use compare and all of the impls in question behave the same way.

If it's used to evaluate how doubles compare somewhere, then, should those answers be consistent with the sort ordering? or Java/Scala operators? I'd presume the former, but, that's not how it works right now. And the choice to use TotalOrdering changes that in 2.13.

If we think the current behavior is correct, and matters, then 2.12 is OK and then we use IeeeOrdering in 2.13 to be conservative

If the current behavior doesn't matter, it doesn't matter what we choose. TotalOrdering feels more logical.

If the current behavior is wrong, we can patch 2.12 to work like 2.13's TotalOrdering. Then the 2.13 choice is already correct, TotalOrdering.

I actually suspect it doesn't matter, doesn't get used.

viirya · 2019-11-27T22:28:30Z

hmm, if the codegen and interpreted behavior is already inconsistent now with scala 2.12, it's a serious bug. @viirya @maropu can you take a look?

You mean current master? I compared Utils.nanSafeCompareDoubles with DoubleExactNumeric.compare:

left: -1.7976931348623157E308, right: NaN, compare(codegen): -1, compare(interpreted): -1
left: NaN, right: -1.7976931348623157E308, compare(codegen): 1, compare(interpreted): 1
left: 1.7976931348623157E308, right: NaN, compare(codegen): -1, compare(interpreted): -1
left: NaN, right: 1.7976931348623157E308, compare(codegen): 1, compare(interpreted): 1
left: NaN, right: NaN, compare(codegen): 0, compare(interpreted): 0

srowen · 2019-11-27T22:48:57Z

Yeah, that's to be expected; the compare functionality hasn't changed. (I think it highlights that Double.compare already does what Utils.nanSafeCompareDoubles does?) The question I think is whether the definition of lt, lteq, etc matters to Spark. I mean, we know it behaves differently in the two Scala 2.13 implementations: IeeeOrdering follows JVM operators, TotalOrdering follows compare. I just don't know if their usage in Spark matters at all, like, whether this ever determines how two doubles compare outside of a sorting task.

maropu · 2019-11-28T05:28:57Z

The ANSI/SQL standard seems to define the order, -Infinity<1.0<Infinity<Nan=Nan.
PgSQL follows that order now;

IEEE754 specifies that NaN should not compare equal to any other floating-point value
(including NaN). In order to allow floating-point values to be sorted and used
in tree-based indexes, PostgreSQL treats NaN values as equal, and greater
than all non-NaN values.

postgres=# insert into t values ('-NaN'), ('-Infinity'), ('+Infinity'), ('+NaN'), ('1.0');
INSERT 0 5
postgres=# select * from t;
     v     
-----------
       NaN
 -Infinity
  Infinity
       NaN
         1
(5 rows)

postgres=# select * from t order by v;
     v     
-----------
 -Infinity
         1
  Infinity
       NaN
       NaN
(5 rows)

Oracle and Spark currently follow this, too. So, in terms of the SQL behaviours,
the most important thing, I think is, to keep this order when switching from 2.12 to 2.13.

srowen · 2019-11-28T12:27:50Z

@maropu yep that's good, I don't expect that changes. What about the equivalent of SELECT Infinity < Nan? that kind of case is the worry, if anything. But this seems to work consistently:

val df = spark.createDataFrame(Seq(Double.NegativeInfinity, Double.PositiveInfinity, Double.MinValue, Double.MaxValue, Double.NaN).map(Tuple1(_)))
df.select($"_1" < Double.NaN).show()

+----------+
|(_1 < NaN)|
+----------+
|      true|
|      true|
|      true|
|      true|
|     false|
+----------+

maropu · 2019-11-29T00:40:24Z

Yea, that should be consisntent. Nan is the maximum value among double/float values including infinity in terms of SQL. So, the result above is correct and that's the same with the Pg one;

postgres=# select * from t;
     v     
-----------
 -Infinity
  Infinity
    1e-307
    1e+308
       NaN
(5 rows)

postgres=# select v < 'NaN' from t;
 ?column? 
----------
 t
 t
 t
 t
 f
(5 rows)

Anyway, the fix in this pr looks pretty reasonable to me.

cloud-fan · 2019-12-04T15:50:18Z

core/src/main/scala-2.13/org/apache/spark/util/OrderingUtil.scala

+ */
+private[spark] object OrderingUtil {
+
+  def compareDouble(x: Double, y: Double): Int = Ordering.Double.TotalOrdering.compare(x, y)


after another thought, shall we just use java.lang.Double.compare? This is how Scala 2.13 implements Ordering.Double.TotalOrdering:
https://github.com/scala/scala/blob/d0bd8241bb60bebc2bf0cbd2e9b01212fd1de93b/src/library/scala/math/Ordering.scala#L452

This is also how Scala 2.12 implements Ordering.Double.compare:
https://github.com/scala/scala/blob/2.12.x/src/library/scala/math/Ordering.scala#L295

Then we don't need to have branches between Scala 2.12 and 2.13. We can also use java.lang.Double.compare to replace Utils.nanSafeCompareDoubles

Yes that sounds fine and simpler. If compare is the only functionality being used, this collapses to something simpler. I can make a follow-up. (We'll need the separate source trees for other reasons though.)

…afeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly ### What changes were proposed in this pull request? Follow up on #26654 (comment) Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`. ### Why are the changes needed? Simplification of the previous change, which existed to support Scala 2.13 migration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #26761 from srowen/SPARK-30009.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

…r Scala 2.12 / 2.13 ### What changes were proposed in this pull request? Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float. Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways: - (Split source tree) - Inline the Scala 2.12 implementation - Reflection For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this. ### Why are the changes needed? Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13. TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators. I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods. ### Does this PR introduce any user-facing change? Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order. ### How was this patch tested? Existing tests so far. Closes apache#26654 from srowen/SPARK-30009. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

…afeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly ### What changes were proposed in this pull request? Follow up on apache#26654 (comment) Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`. ### Why are the changes needed? Simplification of the previous change, which existed to support Scala 2.13 migration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes apache#26761 from srowen/SPARK-30009.2. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

Support different floating-point Ordering for Scala 2.12 / 2.13

2432d65

srowen requested a review from gengliangwang November 25, 2019 01:24

Fix style

57fc3f9

dongjoon-hyun added SPARK CORE SQL labels Nov 25, 2019

gengliangwang approved these changes Nov 26, 2019

View reviewed changes

srowen changed the title ~~[WIP][SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13~~ [SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 Nov 26, 2019

dongjoon-hyun approved these changes Nov 26, 2019

View reviewed changes

dongjoon-hyun closed this in 2901802 Nov 26, 2019

viirya reviewed Nov 27, 2019

View reviewed changes

cloud-fan reviewed Dec 4, 2019

View reviewed changes

srowen mentioned this pull request Dec 4, 2019

[SPARK-30009][CORE][SQL][FOLLOWUP] Remove OrderingUtil and Utils.nanSafeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly #26761

Closed

srowen deleted the SPARK-30009 branch December 6, 2019 19:06

		* It functions like Ordering.Double.TotalOrdering in Scala 2.13, which matches java.lang.Double
		* rather than Scala 2.12's Ordering.Double in handling of NaN.

[SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 #26654

[SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 #26654

Uh oh!

Conversation

srowen commented Nov 25, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Nov 25, 2019

Uh oh!

SparkQA commented Nov 25, 2019

Uh oh!

gengliangwang commented Nov 25, 2019

Uh oh!

srowen commented Nov 25, 2019

Uh oh!

dongjoon-hyun commented Nov 26, 2019

Uh oh!

srowen commented Nov 26, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Nov 27, 2019

Uh oh!

srowen commented Nov 27, 2019

Uh oh!

cloud-fan commented Nov 27, 2019

Uh oh!

srowen commented Nov 27, 2019

Uh oh!

cloud-fan commented Nov 27, 2019

Uh oh!

srowen commented Nov 27, 2019

Uh oh!

viirya Nov 27, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Nov 27, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Nov 27, 2019

Choose a reason for hiding this comment

Uh oh!

viirya commented Nov 27, 2019

Uh oh!

srowen commented Nov 27, 2019

Uh oh!

maropu commented Nov 28, 2019

Uh oh!

srowen commented Nov 28, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maropu commented Nov 29, 2019

Uh oh!

cloud-fan Dec 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

srowen Dec 4, 2019

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

srowen commented Nov 28, 2019 •

edited

Loading

cloud-fan Dec 4, 2019 •

edited

Loading