Skip to content

Commit 69550f7

Browse files
jkbradleymengxr
authored andcommitted
[BRANCH-1.2][SPARK-4583][MLLIB] LogLoss for GradientBoostedTrees fix + doc updates
We reverted #3439 in branch-1.2 due to missing `import o.a.s.SparkContext._`, which is no longer needed in master (#3262). This PR adds #3439 back to branch-1.2 with correct imports. Github is out-of-sync now. The real changes are the last two commits. Author: Joseph K. Bradley <joseph@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes #3474 from mengxr/SPARK-4583-1.2 and squashes the following commits: aca2abb [Xiangrui Meng] add import o.a.s.SparkContext._ for v1.2 6b5564a [Joseph K. Bradley] [SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
1 parent 8fc19e5 commit 69550f7

File tree

6 files changed

+147
-70
lines changed

6 files changed

+147
-70
lines changed

mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -31,18 +31,20 @@ import org.apache.spark.storage.StorageLevel
3131

3232
/**
3333
* :: Experimental ::
34-
* A class that implements Stochastic Gradient Boosting for regression and binary classification.
34+
* A class that implements
35+
* [[http://en.wikipedia.org/wiki/Gradient_boosting Stochastic Gradient Boosting]]
36+
* for regression and binary classification.
3537
*
3638
* The implementation is based upon:
3739
* J.H. Friedman. "Stochastic Gradient Boosting." 1999.
3840
*
39-
* Notes:
40-
* - This currently can be run with several loss functions. However, only SquaredError is
41-
* fully supported. Specifically, the loss function should be used to compute the gradient
42-
* (to re-label training instances on each iteration) and to weight weak hypotheses.
43-
* Currently, gradients are computed correctly for the available loss functions,
44-
* but weak hypothesis weights are not computed correctly for LogLoss or AbsoluteError.
45-
* Running with those losses will likely behave reasonably, but lacks the same guarantees.
41+
* Notes on Gradient Boosting vs. TreeBoost:
42+
* - This implementation is for Stochastic Gradient Boosting, not for TreeBoost.
43+
* - Both algorithms learn tree ensembles by minimizing loss functions.
44+
* - TreeBoost (Friedman, 1999) additionally modifies the outputs at tree leaf nodes
45+
* based on the loss function, whereas the original gradient boosting method does not.
46+
* - When the loss is SquaredError, these methods give the same result, but they could differ
47+
* for other loss functions.
4648
*
4749
* @param boostingStrategy Parameters for the gradient boosting algorithm.
4850
*/

mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala

Lines changed: 43 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -37,7 +37,8 @@ import org.apache.spark.util.Utils
3737

3838
/**
3939
* :: Experimental ::
40-
* A class which implements a random forest learning algorithm for classification and regression.
40+
* A class that implements a [[http://en.wikipedia.org/wiki/Random_forest Random Forest]]
41+
* learning algorithm for classification and regression.
4142
* It supports both continuous and categorical features.
4243
*
4344
* The settings for featureSubsetStrategy are based on the following references:
@@ -70,6 +71,47 @@ private class RandomForest (
7071
private val seed: Int)
7172
extends Serializable with Logging {
7273

74+
/*
75+
ALGORITHM
76+
This is a sketch of the algorithm to help new developers.
77+
78+
The algorithm partitions data by instances (rows).
79+
On each iteration, the algorithm splits a set of nodes. In order to choose the best split
80+
for a given node, sufficient statistics are collected from the distributed data.
81+
For each node, the statistics are collected to some worker node, and that worker selects
82+
the best split.
83+
84+
This setup requires discretization of continuous features. This binning is done in the
85+
findSplitsBins() method during initialization, after which each continuous feature becomes
86+
an ordered discretized feature with at most maxBins possible values.
87+
88+
The main loop in the algorithm operates on a queue of nodes (nodeQueue). These nodes
89+
lie at the periphery of the tree being trained. If multiple trees are being trained at once,
90+
then this queue contains nodes from all of them. Each iteration works roughly as follows:
91+
On the master node:
92+
- Some number of nodes are pulled off of the queue (based on the amount of memory
93+
required for their sufficient statistics).
94+
- For random forests, if featureSubsetStrategy is not "all," then a subset of candidate
95+
features are chosen for each node. See method selectNodesToSplit().
96+
On worker nodes, via method findBestSplits():
97+
- The worker makes one pass over its subset of instances.
98+
- For each (tree, node, feature, split) tuple, the worker collects statistics about
99+
splitting. Note that the set of (tree, node) pairs is limited to the nodes selected
100+
from the queue for this iteration. The set of features considered can also be limited
101+
based on featureSubsetStrategy.
102+
- For each node, the statistics for that node are aggregated to a particular worker
103+
via reduceByKey(). The designated worker chooses the best (feature, split) pair,
104+
or chooses to stop splitting if the stopping criteria are met.
105+
On the master node:
106+
- The master collects all decisions about splitting nodes and updates the model.
107+
- The updated model is passed to the workers on the next iteration.
108+
This process continues until the node queue is empty.
109+
110+
Most of the methods in this implementation support the statistics aggregation, which is
111+
the heaviest part of the computation. In general, this implementation is bound by either
112+
the cost of statistics computation on workers or by communicating the sufficient statistics.
113+
*/
114+
73115
strategy.assertValid()
74116
require(numTrees > 0, s"RandomForest requires numTrees > 0, but was given numTrees = $numTrees.")
75117
require(RandomForest.supportedFeatureSubsetStrategies.contains(featureSubsetStrategy),

mllib/src/main/scala/org/apache/spark/mllib/tree/loss/AbsoluteError.scala

Lines changed: 12 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -25,19 +25,20 @@ import org.apache.spark.rdd.RDD
2525

2626
/**
2727
* :: DeveloperApi ::
28-
* Class for least absolute error loss calculation.
29-
* The features x and the corresponding label y is predicted using the function F.
30-
* For each instance:
31-
* Loss: |y - F|
32-
* Negative gradient: sign(y - F)
28+
* Class for absolute error loss calculation (for regression).
29+
*
30+
* The absolute (L1) error is defined as:
31+
* |y - F(x)|
32+
* where y is the label and F(x) is the model prediction for features x.
3333
*/
3434
@DeveloperApi
3535
object AbsoluteError extends Loss {
3636

3737
/**
3838
* Method to calculate the gradients for the gradient boosting calculation for least
3939
* absolute error calculation.
40-
* @param model Model of the weak learner
40+
* The gradient with respect to F(x) is: sign(F(x) - y)
41+
* @param model Ensemble model
4142
* @param point Instance of the training dataset
4243
* @return Loss gradient
4344
*/
@@ -48,19 +49,17 @@ object AbsoluteError extends Loss {
4849
}
4950

5051
/**
51-
* Method to calculate error of the base learner for the gradient boosting calculation.
52+
* Method to calculate loss of the base learner for the gradient boosting calculation.
5253
* Note: This method is not used by the gradient boosting algorithm but is useful for debugging
5354
* purposes.
54-
* @param model Model of the weak learner.
55+
* @param model Ensemble model
5556
* @param data Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
56-
* @return
57+
* @return Mean absolute error of model on data
5758
*/
5859
override def computeError(model: TreeEnsembleModel, data: RDD[LabeledPoint]): Double = {
59-
val sumOfAbsolutes = data.map { y =>
60+
data.map { y =>
6061
val err = model.predict(y.features) - y.label
6162
math.abs(err)
62-
}.sum()
63-
sumOfAbsolutes / data.count()
63+
}.mean()
6464
}
65-
6665
}

mllib/src/main/scala/org/apache/spark/mllib/tree/loss/LogLoss.scala

Lines changed: 23 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -17,47 +17,58 @@
1717

1818
package org.apache.spark.mllib.tree.loss
1919

20+
import org.apache.spark.SparkContext._
2021
import org.apache.spark.annotation.DeveloperApi
2122
import org.apache.spark.mllib.regression.LabeledPoint
2223
import org.apache.spark.mllib.tree.model.TreeEnsembleModel
2324
import org.apache.spark.rdd.RDD
2425

2526
/**
2627
* :: DeveloperApi ::
27-
* Class for least squares error loss calculation.
28+
* Class for log loss calculation (for classification).
29+
* This uses twice the binomial negative log likelihood, called "deviance" in Friedman (1999).
2830
*
29-
* The features x and the corresponding label y is predicted using the function F.
30-
* For each instance:
31-
* Loss: log(1 + exp(-2yF)), y in {-1, 1}
32-
* Negative gradient: 2y / ( 1 + exp(2yF))
31+
* The log loss is defined as:
32+
* 2 log(1 + exp(-2 y F(x)))
33+
* where y is a label in {-1, 1} and F(x) is the model prediction for features x.
3334
*/
3435
@DeveloperApi
3536
object LogLoss extends Loss {
3637

3738
/**
3839
* Method to calculate the loss gradients for the gradient boosting calculation for binary
3940
* classification
40-
* @param model Model of the weak learner
41+
* The gradient with respect to F(x) is: - 4 y / (1 + exp(2 y F(x)))
42+
* @param model Ensemble model
4143
* @param point Instance of the training dataset
4244
* @return Loss gradient
4345
*/
4446
override def gradient(
4547
model: TreeEnsembleModel,
4648
point: LabeledPoint): Double = {
4749
val prediction = model.predict(point.features)
48-
1.0 / (1.0 + math.exp(-prediction)) - point.label
50+
- 4.0 * point.label / (1.0 + math.exp(2.0 * point.label * prediction))
4951
}
5052

5153
/**
52-
* Method to calculate error of the base learner for the gradient boosting calculation.
54+
* Method to calculate loss of the base learner for the gradient boosting calculation.
5355
* Note: This method is not used by the gradient boosting algorithm but is useful for debugging
5456
* purposes.
55-
* @param model Model of the weak learner.
57+
* @param model Ensemble model
5658
* @param data Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
57-
* @return
59+
* @return Mean log loss of model on data
5860
*/
5961
override def computeError(model: TreeEnsembleModel, data: RDD[LabeledPoint]): Double = {
60-
val wrongPredictions = data.filter(lp => model.predict(lp.features) != lp.label).count()
61-
wrongPredictions / data.count
62+
data.map { case point =>
63+
val prediction = model.predict(point.features)
64+
val margin = 2.0 * point.label * prediction
65+
// The following are equivalent to 2.0 * log(1 + exp(-margin)) but are more numerically
66+
// stable.
67+
if (margin >= 0) {
68+
2.0 * math.log1p(math.exp(-margin))
69+
} else {
70+
2.0 * (-margin + math.log1p(math.exp(margin)))
71+
}
72+
}.mean()
6273
}
6374
}

mllib/src/main/scala/org/apache/spark/mllib/tree/loss/SquaredError.scala

Lines changed: 10 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -25,42 +25,41 @@ import org.apache.spark.rdd.RDD
2525

2626
/**
2727
* :: DeveloperApi ::
28-
* Class for least squares error loss calculation.
28+
* Class for squared error loss calculation.
2929
*
30-
* The features x and the corresponding label y is predicted using the function F.
31-
* For each instance:
32-
* Loss: (y - F)**2/2
33-
* Negative gradient: y - F
30+
* The squared (L2) error is defined as:
31+
* (y - F(x))**2
32+
* where y is the label and F(x) is the model prediction for features x.
3433
*/
3534
@DeveloperApi
3635
object SquaredError extends Loss {
3736

3837
/**
3938
* Method to calculate the gradients for the gradient boosting calculation for least
4039
* squares error calculation.
41-
* @param model Model of the weak learner
40+
* The gradient with respect to F(x) is: - 2 (y - F(x))
41+
* @param model Ensemble model
4242
* @param point Instance of the training dataset
4343
* @return Loss gradient
4444
*/
4545
override def gradient(
4646
model: TreeEnsembleModel,
4747
point: LabeledPoint): Double = {
48-
model.predict(point.features) - point.label
48+
2.0 * (model.predict(point.features) - point.label)
4949
}
5050

5151
/**
52-
* Method to calculate error of the base learner for the gradient boosting calculation.
52+
* Method to calculate loss of the base learner for the gradient boosting calculation.
5353
* Note: This method is not used by the gradient boosting algorithm but is useful for debugging
5454
* purposes.
55-
* @param model Model of the weak learner.
55+
* @param model Ensemble model
5656
* @param data Training dataset: RDD of [[org.apache.spark.mllib.regression.LabeledPoint]].
57-
* @return
57+
* @return Mean squared error of model on data
5858
*/
5959
override def computeError(model: TreeEnsembleModel, data: RDD[LabeledPoint]): Double = {
6060
data.map { y =>
6161
val err = model.predict(y.features) - y.label
6262
err * err
6363
}.mean()
6464
}
65-
6665
}

mllib/src/test/scala/org/apache/spark/mllib/tree/GradientBoostedTreesSuite.scala

Lines changed: 49 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -35,32 +35,39 @@ class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext {
3535
test("Regression with continuous features: SquaredError") {
3636
GradientBoostedTreesSuite.testCombinations.foreach {
3737
case (numIterations, learningRate, subsamplingRate) =>
38-
val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
39-
val rdd = sc.parallelize(arr, 2)
40-
41-
val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2,
42-
categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate)
43-
val boostingStrategy =
44-
new BoostingStrategy(treeStrategy, SquaredError, numIterations, learningRate)
45-
46-
val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
47-
48-
assert(gbt.trees.size === numIterations)
49-
EnsembleTestHelper.validateRegressor(gbt, arr, 0.03)
50-
51-
val remappedInput = rdd.map(x => new LabeledPoint((x.label * 2) - 1, x.features))
52-
val dt = DecisionTree.train(remappedInput, treeStrategy)
53-
54-
// Make sure trees are the same.
55-
assert(gbt.trees.head.toString == dt.toString)
38+
GradientBoostedTreesSuite.randomSeeds.foreach { randomSeed =>
39+
val rdd = sc.parallelize(GradientBoostedTreesSuite.data, 2)
40+
41+
val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2,
42+
categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate)
43+
val boostingStrategy =
44+
new BoostingStrategy(treeStrategy, SquaredError, numIterations, learningRate)
45+
46+
val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
47+
48+
assert(gbt.trees.size === numIterations)
49+
try {
50+
EnsembleTestHelper.validateRegressor(gbt, GradientBoostedTreesSuite.data, 0.06)
51+
} catch {
52+
case e: java.lang.AssertionError =>
53+
println(s"FAILED for numIterations=$numIterations, learningRate=$learningRate," +
54+
s" subsamplingRate=$subsamplingRate")
55+
throw e
56+
}
57+
58+
val remappedInput = rdd.map(x => new LabeledPoint((x.label * 2) - 1, x.features))
59+
val dt = DecisionTree.train(remappedInput, treeStrategy)
60+
61+
// Make sure trees are the same.
62+
assert(gbt.trees.head.toString == dt.toString)
63+
}
5664
}
5765
}
5866

5967
test("Regression with continuous features: Absolute Error") {
6068
GradientBoostedTreesSuite.testCombinations.foreach {
6169
case (numIterations, learningRate, subsamplingRate) =>
62-
val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
63-
val rdd = sc.parallelize(arr, 2)
70+
val rdd = sc.parallelize(GradientBoostedTreesSuite.data, 2)
6471

6572
val treeStrategy = new Strategy(algo = Regression, impurity = Variance, maxDepth = 2,
6673
categoricalFeaturesInfo = Map.empty, subsamplingRate = subsamplingRate)
@@ -70,7 +77,14 @@ class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext {
7077
val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
7178

7279
assert(gbt.trees.size === numIterations)
73-
EnsembleTestHelper.validateRegressor(gbt, arr, 0.85, "mae")
80+
try {
81+
EnsembleTestHelper.validateRegressor(gbt, GradientBoostedTreesSuite.data, 0.85, "mae")
82+
} catch {
83+
case e: java.lang.AssertionError =>
84+
println(s"FAILED for numIterations=$numIterations, learningRate=$learningRate," +
85+
s" subsamplingRate=$subsamplingRate")
86+
throw e
87+
}
7488

7589
val remappedInput = rdd.map(x => new LabeledPoint((x.label * 2) - 1, x.features))
7690
val dt = DecisionTree.train(remappedInput, treeStrategy)
@@ -83,8 +97,7 @@ class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext {
8397
test("Binary classification with continuous features: Log Loss") {
8498
GradientBoostedTreesSuite.testCombinations.foreach {
8599
case (numIterations, learningRate, subsamplingRate) =>
86-
val arr = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
87-
val rdd = sc.parallelize(arr, 2)
100+
val rdd = sc.parallelize(GradientBoostedTreesSuite.data, 2)
88101

89102
val treeStrategy = new Strategy(algo = Classification, impurity = Variance, maxDepth = 2,
90103
numClassesForClassification = 2, categoricalFeaturesInfo = Map.empty,
@@ -95,7 +108,14 @@ class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext {
95108
val gbt = GradientBoostedTrees.train(rdd, boostingStrategy)
96109

97110
assert(gbt.trees.size === numIterations)
98-
EnsembleTestHelper.validateClassifier(gbt, arr, 0.9)
111+
try {
112+
EnsembleTestHelper.validateClassifier(gbt, GradientBoostedTreesSuite.data, 0.9)
113+
} catch {
114+
case e: java.lang.AssertionError =>
115+
println(s"FAILED for numIterations=$numIterations, learningRate=$learningRate," +
116+
s" subsamplingRate=$subsamplingRate")
117+
throw e
118+
}
99119

100120
val remappedInput = rdd.map(x => new LabeledPoint((x.label * 2) - 1, x.features))
101121
val ensembleStrategy = treeStrategy.copy
@@ -113,5 +133,9 @@ class GradientBoostedTreesSuite extends FunSuite with MLlibTestSparkContext {
113133
object GradientBoostedTreesSuite {
114134

115135
// Combinations for estimators, learning rates and subsamplingRate
116-
val testCombinations = Array((10, 1.0, 1.0), (10, 0.1, 1.0), (10, 1.0, 0.75), (10, 0.1, 0.75))
136+
val testCombinations = Array((10, 1.0, 1.0), (10, 0.1, 1.0), (10, 0.5, 0.75), (10, 0.1, 0.75))
137+
138+
val randomSeeds = Array(681283, 4398)
139+
140+
val data = EnsembleTestHelper.generateOrderedLabeledPoints(numFeatures = 10, 100)
117141
}

0 commit comments

Comments
 (0)