Skip to content

Commit f92c827

Browse files
ahmed-mahransrowen
authored andcommitted
[SPARK-41008][MLLIB] Follow-up isotonic regression features deduplica…
### What changes were proposed in this pull request? A follow-up on #38966 to update relevant documentation and remove redundant sort key. ### Why are the changes needed? For isotonic regression, another method for breaking ties of repeated features was introduced in #38966. This will aggregate points having the same feature value by computing the weighted average of the labels. - This only requires points to be sorted by features instead of features and labels. So, we should remove label as a secondary sorting key. - Isotonic regression documentation needs to be updated to reflect the new behavior. ### Does this PR introduce _any_ user-facing change? Isotonic regression documentation update. The documentation described the behavior of the algorithm when there are points in the input with repeated features. Since this behavior has changed, documentation needs to describe the new behavior. ### How was this patch tested? Existing tests passed. No need to add new tests since existing tests are already comprehensive. srowen Closes #38996 from ahmed-mahran/ml-isotonic-reg-dups-follow-up. Authored-by: Ahmed Mahran <ahmed.mahran@mashin.io> Signed-off-by: Sean Owen <srowen@gmail.com>
1 parent af33722 commit f92c827

File tree

3 files changed

+67
-62
lines changed

3 files changed

+67
-62
lines changed

docs/mllib-isotonic-regression.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,14 @@ best fitting the original data points.
4343
which uses an approach to
4444
[parallelizing isotonic regression](https://doi.org/10.1007/978-3-642-99789-1_10).
4545
The training input is an RDD of tuples of three double values that represent
46-
label, feature and weight in this order. Additionally, IsotonicRegression algorithm has one
46+
label, feature and weight in this order. In case there are multiple tuples with
47+
the same feature then these tuples are aggregated into a single tuple as follows:
48+
49+
* Aggregated label is the weighted average of all labels.
50+
* Aggregated feature is the unique feature value.
51+
* Aggregated weight is the sum of all weights.
52+
53+
Additionally, IsotonicRegression algorithm has one
4754
optional parameter called $isotonic$ defaulting to true.
4855
This argument specifies if the isotonic regression is
4956
isotonic (monotonically increasing) or antitonic (monotonically decreasing).
@@ -53,17 +60,12 @@ labels for both known and unknown features. The result of isotonic regression
5360
is treated as piecewise linear function. The rules for prediction therefore are:
5461

5562
* If the prediction input exactly matches a training feature
56-
then associated prediction is returned. In case there are multiple predictions with the same
57-
feature then one of them is returned. Which one is undefined
58-
(same as java.util.Arrays.binarySearch).
63+
then associated prediction is returned.
5964
* If the prediction input is lower or higher than all training features
6065
then prediction with lowest or highest feature is returned respectively.
61-
In case there are multiple predictions with the same feature
62-
then the lowest or highest is returned respectively.
6366
* If the prediction input falls between two training features then prediction is treated
6467
as piecewise linear function and interpolated value is calculated from the
65-
predictions of the two closest features. In case there are multiple values
66-
with the same feature then the same rules as in previous point are used.
68+
predictions of the two closest features.
6769

6870
### Examples
6971

mllib/src/main/scala/org/apache/spark/mllib/regression/IsotonicRegression.scala

Lines changed: 38 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,6 @@ import java.util.Arrays.binarySearch
2323
import scala.collection.JavaConverters._
2424
import scala.collection.mutable.ArrayBuffer
2525

26-
import org.apache.commons.math3.util.Precision
2726
import org.json4s._
2827
import org.json4s.JsonDSL._
2928
import org.json4s.jackson.JsonMethods._
@@ -272,8 +271,8 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
272271
* @param input RDD of tuples (label, feature, weight) where label is dependent variable
273272
* for which we calculate isotonic regression, feature is independent variable
274273
* and weight represents number of measures with default 1.
275-
* If multiple labels share the same feature value then they are ordered before
276-
* the algorithm is executed.
274+
* If multiple labels share the same feature value then they are aggregated using
275+
* the weighted average before the algorithm is executed.
277276
* @return Isotonic regression model.
278277
*/
279278
@Since("1.3.0")
@@ -298,8 +297,8 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
298297
* @param input JavaRDD of tuples (label, feature, weight) where label is dependent variable
299298
* for which we calculate isotonic regression, feature is independent variable
300299
* and weight represents number of measures with default 1.
301-
* If multiple labels share the same feature value then they are ordered before
302-
* the algorithm is executed.
300+
* If multiple labels share the same feature value then they are aggregated using
301+
* the weighted average before the algorithm is executed.
303302
* @return Isotonic regression model.
304303
*/
305304
@Since("1.3.0")
@@ -310,21 +309,14 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
310309
/**
311310
* Aggregates points of duplicate feature values into a single point using as label the weighted
312311
* average of the labels of the points with duplicate feature values. All points for a unique
313-
* feature values are aggregated as:
312+
* feature value are aggregated as:
314313
*
315-
* - Aggregated label is the weighted average of all labels
316-
* - Aggregated feature is the weighted average of all equal features[1]
317-
* - Aggregated weight is the sum of all weights
314+
* - Aggregated label is the weighted average of all labels.
315+
* - Aggregated feature is the unique feature value.
316+
* - Aggregated weight is the sum of all weights.
318317
*
319-
* [1] Note: It is possible that feature values to be equal up to a resolution due to
320-
* representation errors, since we cannot know which feature value to use in that case, we
321-
* compute the weighted average of the features. Ideally, all feature values will be equal and
322-
* the weighted average is just the value at any point.
323-
*
324-
* @param input
325-
* Input data of tuples (label, feature, weight). Weights must be non-negative.
326-
* @return
327-
* Points with unique feature values.
318+
* @param input Input data of tuples (label, feature, weight). Weights must be non-negative.
319+
* @return Points with unique feature values.
328320
*/
329321
private[regression] def makeUnique(
330322
input: Array[(Double, Double, Double)]): Array[(Double, Double, Double)] = {
@@ -339,28 +331,28 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
339331
if (cleanInput.length <= 1) {
340332
cleanInput
341333
} else {
342-
// whether or not two double features are equal up to a precision
343-
@inline def areEqual(a: Double, b: Double): Boolean = Precision.equals(a, b)
344-
345334
val pointsAccumulator = new IsotonicRegression.PointsAccumulator
346-
var (_, prevFeature, _) = cleanInput.head
347-
348-
// Go through input points, merging all points with approximately equal feature values into
349-
// a single point. Equality of features is defined by areEqual method. The label of the
350-
// accumulated points is the weighted average of the labels of all points of equal feature
351-
// value. It is possible that feature values to be equal up to a resolution due to
352-
// representation errors, since we cannot know which feature value to use in that case,
353-
// we compute the weighted average of the features.
354-
cleanInput.foreach { case point @ (_, feature, _) =>
355-
if (areEqual(feature, prevFeature)) {
335+
336+
// Go through input points, merging all points with equal feature values into a single point.
337+
// Equality of features is defined by shouldAccumulate method. The label of the accumulated
338+
// points is the weighted average of the labels of all points of equal feature value.
339+
340+
// Initialize with first point
341+
pointsAccumulator := cleanInput.head
342+
// Accumulate the rest
343+
cleanInput.tail.foreach { case point @ (_, feature, _) =>
344+
if (pointsAccumulator.shouldAccumulate(feature)) {
345+
// Still on a duplicate feature, accumulate
356346
pointsAccumulator += point
357347
} else {
348+
// A new unique feature encountered:
349+
// - append the last accumulated point to unique features output
358350
pointsAccumulator.appendToOutput()
351+
// - and reset
359352
pointsAccumulator := point
360353
}
361-
prevFeature = feature
362354
}
363-
// Append the last accumulated point
355+
// Append the last accumulated point to unique features output
364356
pointsAccumulator.appendToOutput()
365357
pointsAccumulator.getOutput
366358
}
@@ -488,14 +480,14 @@ class IsotonicRegression private (private var isotonic: Boolean) extends Seriali
488480
// Points with same or adjacent features must collocate within the same partition.
489481
.partitionBy(new RangePartitioner(keyedInput.getNumPartitions, keyedInput))
490482
.values
491-
// Lexicographically sort points by features then labels.
492-
.mapPartitions(p => Iterator(p.toArray.sortBy(x => (x._2, x._1))))
483+
// Lexicographically sort points by features.
484+
.mapPartitions(p => Iterator(p.toArray.sortBy(_._2)))
493485
// Aggregate points with equal features into a single point.
494486
.map(makeUnique)
495487
.flatMap(poolAdjacentViolators)
496488
.collect()
497489
// Sort again because collect() doesn't promise ordering.
498-
.sortBy(x => (x._2, x._1))
490+
.sortBy(_._2)
499491
poolAdjacentViolators(parallelStepResult)
500492
}
501493
}
@@ -511,30 +503,32 @@ object IsotonicRegression {
511503
private var (currentLabel: Double, currentFeature: Double, currentWeight: Double) =
512504
(0d, 0d, 0d)
513505

506+
/** Whether or not this feature exactly equals the current accumulated feature. */
507+
@inline def shouldAccumulate(feature: Double): Boolean = currentFeature == feature
508+
514509
/** Resets the current value of the point accumulator using the provided point. */
515-
def :=(point: (Double, Double, Double)): Unit = {
510+
@inline def :=(point: (Double, Double, Double)): Unit = {
516511
val (label, feature, weight) = point
517512
currentLabel = label * weight
518-
currentFeature = feature * weight
513+
currentFeature = feature
519514
currentWeight = weight
520515
}
521516

522517
/** Accumulates the provided point into the current value of the point accumulator. */
523-
def +=(point: (Double, Double, Double)): Unit = {
524-
val (label, feature, weight) = point
518+
@inline def +=(point: (Double, Double, Double)): Unit = {
519+
val (label, _, weight) = point
525520
currentLabel += label * weight
526-
currentFeature += feature * weight
527521
currentWeight += weight
528522
}
529523

530524
/** Appends the current value of the point accumulator to the output. */
531-
def appendToOutput(): Unit =
525+
@inline def appendToOutput(): Unit =
532526
output += ((
533527
currentLabel / currentWeight,
534-
currentFeature / currentWeight,
528+
currentFeature,
535529
currentWeight))
536530

537531
/** Returns all accumulated points so far. */
538-
def getOutput: Array[(Double, Double, Double)] = output.toArray
532+
@inline def getOutput: Array[(Double, Double, Double)] = output.toArray
539533
}
540534
}

mllib/src/test/scala/org/apache/spark/mllib/regression/IsotonicRegressionSuite.scala

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,6 @@
1717

1818
package org.apache.spark.mllib.regression
1919

20-
import org.apache.commons.math3.util.Precision
2120
import org.scalatest.matchers.must.Matchers
2221

2322
import org.apache.spark.{SparkException, SparkFunSuite}
@@ -225,12 +224,18 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
225224

226225
test("SPARK-41008 isotonic regression with duplicate features differs from sklearn") {
227226
val model = runIsotonicRegressionOnInput(
228-
Seq((2, 1, 1), (1, 1, 1), (0, 2, 1), (1, 2, 1), (0.5, 3, 1), (0, 3, 1)),
227+
Seq((1, 0.6, 1), (0, 0.6, 1),
228+
(0, 1.0 / 3, 1), (1, 1.0 / 3, 1), (0, 1.0 / 3, 1),
229+
(1, 0.2, 1), (0, 0.2, 1), (0, 0.2, 1), (0, 0.2, 1)),
229230
true,
230231
2)
231232

232-
assert(model.boundaries === Array(1.0, 3.0))
233-
assert(model.predictions === Array(0.75, 0.75))
233+
assert(model.boundaries === Array(0.2, 1.0 / 3, 0.6))
234+
assert(model.predictions === Array(0.25, 1.0 / 3, 0.5))
235+
236+
assert(model.predict(0.6) === 0.5)
237+
assert(model.predict(1.0 / 3) === 1.0 / 3)
238+
assert(model.predict(0.2) === 0.25)
234239
}
235240

236241
test("isotonic regression prediction") {
@@ -327,9 +332,8 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
327332
test("makeUnique: handle duplicate features") {
328333
val regressor = new IsotonicRegression()
329334
import regressor.makeUnique
330-
import Precision.EPSILON
331335

332-
// Note: input must be lexicographically sorted by (feature, label)
336+
// Note: input must be lexicographically sorted by feature
333337

334338
// empty
335339
assert(makeUnique(Array.empty) === Array.empty)
@@ -373,9 +377,14 @@ class IsotonicRegressionSuite extends SparkFunSuite with MLlibTestSparkContext w
373377
(10.0 * 0.3 + 20.0 * 0.3 + 30.0 * 0.4, 2.0, 1.0),
374378
(10.0, 3.0, 1.0)))
375379

376-
// duplicate up to resolution error
377-
assert(
378-
makeUnique(Array((1.0, 1.0, 1.0), (1.0, 1.0 - EPSILON, 1.0), (1.0, 1.0 + EPSILON, 1.0))) ===
379-
Array((1.0, 1.0, 3.0)))
380+
// don't handle tiny representation errors
381+
// e.g. infinitely adjacent doubles are already unique
382+
val adjacentDoubles = {
383+
// i-th next representable double to 1.0 is java.lang.Double.longBitsToDouble(base + i)
384+
val base = java.lang.Double.doubleToRawLongBits(1.0)
385+
(0 until 10).map(i => java.lang.Double.longBitsToDouble(base + i))
386+
.map((1.0, _, 1.0)).toArray
387+
}
388+
assert(makeUnique(adjacentDoubles) === adjacentDoubles)
380389
}
381390
}

0 commit comments

Comments
 (0)