Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Nov 25, 2019

What changes were proposed in this pull request?

Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float.

Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways:

  • (Split source tree)
  • Inline the Scala 2.12 implementation
  • Reflection

For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this.

Why are the changes needed?

Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13.

TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators.

I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods.

Does this PR introduce any user-facing change?

Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order.

How was this patch tested?

Existing tests so far.

@srowen srowen requested a review from gengliangwang November 25, 2019 01:24
@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114364 has finished for PR 26654 at commit 2432d65.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114367 has finished for PR 26654 at commit 57fc3f9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

It seems reasonable to me.
BTW, how can we let Jenkins test with Scala 2.13 as well?

@srowen
Copy link
Member Author

srowen commented Nov 25, 2019

We can make a Jenkins job later; it's not nearly passing yet. I am testing locally on 2.13. There is at least one large change that I want to hold until after branch-3.0 is cut, and at least one we can't resolve yet. These changes are relatively easy ones with little or no impact on 2.12.

@dongjoon-hyun
Copy link
Member

Shall we remove [WIP] from the PR title?

@srowen srowen changed the title [WIP][SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 [SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 Nov 26, 2019
@srowen
Copy link
Member Author

srowen commented Nov 26, 2019

Yep, sounds like there's no objection to this approach.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you, @srowen and @gengliangwang !

@gatorsmile
Copy link
Member

How about the code gen results? Are they consistent?

@cloud-fan @rednaxelafx

@srowen
Copy link
Member Author

srowen commented Nov 27, 2019

@gatorsmile good question. This change won't affect Scala 2.12 at all. How it affects 2.13 is still unknown; it isn't compilable just yet, so can't run the tests. The compile changes are large enough that I'm tackling them iteratively, but, at the end we may revise the 2.13 changes to ensure correctness, etc. This is a good example.

@cloud-fan
Copy link
Contributor

In codegen we use Utils.nanSafeCompareDoubles. Shall we simply implement Ordering by our own using the same method? Then it's guaranteed that codegen and interpreted code paths have the same behavior.

@srowen
Copy link
Member Author

srowen commented Nov 27, 2019

Yeah, that method does what TotalOrdering does. We could adopt it instead of using Scala 2.13 TotalOrdering; it won't matter.

But the two sites that changed here used Scala 2.12's Ordering.Double, which does not work like that method. For the SorterSuite? won't matter. For numerics.scala, it could matter.

We can either leave this in place to maintain 2.12 behavior, or if we think that the current 2.12 behavior in numerics.scala is a bit wrong and inconsistent, we can instead standardize on the Utils method everywhere and remove the need for OrderingUtil. (There will be other reasons we need the parallel build tree though.)

WDYT?

@cloud-fan
Copy link
Contributor

hmm, if the codegen and interpreted behavior is already inconsistent now with scala 2.12, it's a serious bug. @viirya @maropu can you take a look?

@srowen
Copy link
Member Author

srowen commented Nov 27, 2019

Some extra context. Here's the key difference between the two 2.13 orderings:

scala> Ordering.Double.TotalOrdering.compare(Double.NaN, Double.MaxValue)
res5: Int = 1
scala> Ordering.Double.TotalOrdering.gt(Double.NaN, Double.MaxValue)
res6: Boolean = true
scala> Ordering.Double.IeeeOrdering.compare(Double.NaN, Double.MaxValue)
res7: Int = 1
scala> Ordering.Double.IeeeOrdering.gt(Double.NaN, Double.MaxValue)
res8: Boolean = false

The weird thing is IeeeOrdering tries to work like the <, <=, etc operators, but still uses java.lang.Double.compare for compare, which seems internally inconsistent.

Scala 2.12's implementation of Ordering.Double looks the same as IeeeOrdering:

  trait DoubleOrdering extends Ordering[Double] {
    outer =>

    def compare(x: Double, y: Double) = java.lang.Double.compare(x, y)

    override def lteq(x: Double, y: Double): Boolean = x <= y
    override def gteq(x: Double, y: Double): Boolean = x >= y
    override def lt(x: Double, y: Double): Boolean = x < y
    override def gt(x: Double, y: Double): Boolean = x > y
   ...

TotalOrdering delegates all of that to the result of java.lang.Double.compare, which appears to already handle NaNs consistently by sorting them above all other non-NaN values.

Actually, do we need Utils.nanSafeCompareDoubles then, if it's the same?

So, back to the point: the question is whether this difference matters in the current build. The difference arises in a few places:

  • Double/FloatExactNumeric in numerics.scala used in Double/FloatType as their 'exactNumeric' implementation
  • Double/FloatType, which define an ordering based on Utils.nanSafeCompareXXX, but also use things like DoubleAsIfIntegral which inherits from DoubleOrdering.

I don't know if these two occurrences' ordering actually matters, as on a glance, it looks like these objects do not use them for ordering. I don't know the details enough to figure whether there is a subtle issue there.

I didn't immediately see a test which exercises something like sorting NaNs with non-NaN values, and am not sure how to write it to check code gen vs interpreted. But that would tell us one way or the other whether there is an issue.

In any event, it seems like:

  • The choice of TotalOrdering in 2.13 is good as it's consistent with Double.compare, Utils.nanSafeCompareDoubles
  • We will have to see whether 2.13 tests pass later! and revisit if not
  • We might even be able to remove nanSafeCompareDoubles for simplicity, but not important

Comment on lines +25 to +26
* It functions like Ordering.Double.TotalOrdering in Scala 2.13, which matches java.lang.Double
* rather than Scala 2.12's Ordering.Double in handling of NaN.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, doesn't Scala 2.12's Ordering.Double.compare delegates to java.lang.Double.compare?

why this matches java.lang.Double, but not Scala 2.12's Ordering.Double?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, for compare. The superclass Ordering then defines operations like lt, lteq, etc in terms of compare. But 2.12 Ordering.Double overrides them to use operations like <, <=. As far as I can tell it presents a consistent total ordering via compare already (as it appears java.lang.Double.compare does), but its comparisons aren't consistent with how NaNs behave in lt, etc. Then again... perhaps neither is Java. Its Comparators.natural() would work consistently, but the comparisons don't match the Java operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I guess the question is, when to the Orderings get used by Spark? it's not immediately clear if they are, even. If they're used for sorting, all is well... I think, as sorting would use compare and all of the impls in question behave the same way.

If it's used to evaluate how doubles compare somewhere, then, should those answers be consistent with the sort ordering? or Java/Scala operators? I'd presume the former, but, that's not how it works right now. And the choice to use TotalOrdering changes that in 2.13.

  • If we think the current behavior is correct, and matters, then 2.12 is OK and then we use IeeeOrdering in 2.13 to be conservative
  • If the current behavior doesn't matter, it doesn't matter what we choose. TotalOrdering feels more logical.
  • If the current behavior is wrong, we can patch 2.12 to work like 2.13's TotalOrdering. Then the 2.13 choice is already correct, TotalOrdering.

I actually suspect it doesn't matter, doesn't get used.

@viirya
Copy link
Member

viirya commented Nov 27, 2019

hmm, if the codegen and interpreted behavior is already inconsistent now with scala 2.12, it's a serious bug. @viirya @maropu can you take a look?

You mean current master? I compared Utils.nanSafeCompareDoubles with DoubleExactNumeric.compare:

left: -1.7976931348623157E308, right: NaN, compare(codegen): -1, compare(interpreted): -1
left: NaN, right: -1.7976931348623157E308, compare(codegen): 1, compare(interpreted): 1
left: 1.7976931348623157E308, right: NaN, compare(codegen): -1, compare(interpreted): -1
left: NaN, right: 1.7976931348623157E308, compare(codegen): 1, compare(interpreted): 1
left: NaN, right: NaN, compare(codegen): 0, compare(interpreted): 0 

@srowen
Copy link
Member Author

srowen commented Nov 27, 2019

Yeah, that's to be expected; the compare functionality hasn't changed. (I think it highlights that Double.compare already does what Utils.nanSafeCompareDoubles does?) The question I think is whether the definition of lt, lteq, etc matters to Spark. I mean, we know it behaves differently in the two Scala 2.13 implementations: IeeeOrdering follows JVM operators, TotalOrdering follows compare. I just don't know if their usage in Spark matters at all, like, whether this ever determines how two doubles compare outside of a sorting task.

@maropu
Copy link
Member

maropu commented Nov 28, 2019

The ANSI/SQL standard seems to define the order, -Infinity<1.0<Infinity<Nan=Nan.
PgSQL follows that order now;

IEEE754 specifies that NaN should not compare equal to any other floating-point value
(including NaN). In order to allow floating-point values to be sorted and used
in tree-based indexes, PostgreSQL treats NaN values as equal, and greater
than all non-NaN values.
postgres=# insert into t values ('-NaN'), ('-Infinity'), ('+Infinity'), ('+NaN'), ('1.0');
INSERT 0 5
postgres=# select * from t;
     v     
-----------
       NaN
 -Infinity
  Infinity
       NaN
         1
(5 rows)

postgres=# select * from t order by v;
     v     
-----------
 -Infinity
         1
  Infinity
       NaN
       NaN
(5 rows)

Oracle and Spark currently follow this, too. So, in terms of the SQL behaviours,
the most important thing, I think is, to keep this order when switching from 2.12 to 2.13.

@srowen
Copy link
Member Author

srowen commented Nov 28, 2019

@maropu yep that's good, I don't expect that changes. What about the equivalent of SELECT Infinity < Nan? that kind of case is the worry, if anything. But this seems to work consistently:

val df = spark.createDataFrame(Seq(Double.NegativeInfinity, Double.PositiveInfinity, Double.MinValue, Double.MaxValue, Double.NaN).map(Tuple1(_)))
df.select($"_1" < Double.NaN).show()

+----------+
|(_1 < NaN)|
+----------+
|      true|
|      true|
|      true|
|      true|
|     false|
+----------+

@maropu
Copy link
Member

maropu commented Nov 29, 2019

Yea, that should be consisntent. Nan is the maximum value among double/float values including infinity in terms of SQL. So, the result above is correct and that's the same with the Pg one;

postgres=# select * from t;
     v     
-----------
 -Infinity
  Infinity
    1e-307
    1e+308
       NaN
(5 rows)

postgres=# select v < 'NaN' from t;
 ?column? 
----------
 t
 t
 t
 t
 f
(5 rows)

Anyway, the fix in this pr looks pretty reasonable to me.

*/
private[spark] object OrderingUtil {

def compareDouble(x: Double, y: Double): Int = Ordering.Double.TotalOrdering.compare(x, y)
Copy link
Contributor

@cloud-fan cloud-fan Dec 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after another thought, shall we just use java.lang.Double.compare? This is how Scala 2.13 implements Ordering.Double.TotalOrdering:
https://github.com/scala/scala/blob/d0bd8241bb60bebc2bf0cbd2e9b01212fd1de93b/src/library/scala/math/Ordering.scala#L452

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also how Scala 2.12 implements Ordering.Double.compare:
https://github.com/scala/scala/blob/2.12.x/src/library/scala/math/Ordering.scala#L295

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we don't need to have branches between Scala 2.12 and 2.13. We can also use java.lang.Double.compare to replace Utils.nanSafeCompareDoubles

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds fine and simpler. If compare is the only functionality being used, this collapses to something simpler. I can make a follow-up. (We'll need the separate source trees for other reasons though.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #26761

cloud-fan pushed a commit that referenced this pull request Dec 5, 2019
…afeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly

### What changes were proposed in this pull request?

Follow up on #26654 (comment)
Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`.

### Why are the changes needed?

Simplification of the previous change, which existed to support Scala 2.13 migration.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #26761 from srowen/SPARK-30009.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@srowen srowen deleted the SPARK-30009 branch December 6, 2019 19:06
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…r Scala 2.12 / 2.13

### What changes were proposed in this pull request?

Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float.

Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways:

- (Split source tree)
- Inline the Scala 2.12 implementation
- Reflection

For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this.

### Why are the changes needed?

Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13.

TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators.

I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods.

### Does this PR introduce any user-facing change?

Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order.

### How was this patch tested?

Existing tests so far.

Closes apache#26654 from srowen/SPARK-30009.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Dec 6, 2019
…afeCompare{Doubles,Floats} and use java.lang.{Double,Float}.compare directly

### What changes were proposed in this pull request?

Follow up on apache#26654 (comment)
Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`.

### Why are the changes needed?

Simplification of the previous change, which existed to support Scala 2.13 migration.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes apache#26761 from srowen/SPARK-30009.2.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants