[SPARK-18239][SPARKR] Gradient Boosted Tree for R #15746

felixcheung · 2016-11-03T04:55:06Z

What changes were proposed in this pull request?

Gradient Boosted Tree in R.
With a few minor improvements to RandomForest in R.

Since this is relatively isolated I'd like to target this for branch-2.1

How was this patch tested?

manual tests, unit tests

felixcheung · 2016-11-03T04:56:32Z

python/pyspark/ml/regression.py

 class RandomForestRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol, HasSeed,
                            RandomForestParams, TreeRegressorParams, HasCheckpointInterval,
-                            JavaMLWritable, JavaMLReadable, HasVarianceCol):
+                            JavaMLWritable, JavaMLReadable):


this was an erranous change - RandomForest does not have a variance column, unlike DecisionTree, so removing it

felixcheung · 2016-11-03T04:59:28Z

R/pkg/R/mllib.R

                   featureSubsetStrategy = "auto", seed = NULL, subsamplingRate = 1.0,
-                   probabilityCol = "probability", maxMemoryInMB = 256, cacheNodeIds = FALSE) {
+                   minInstancesPerNode = 1, minInfoGain = 0.0, checkpointInterval = 10,
+                   maxMemoryInMB = 256, cacheNodeIds = FALSE, probabilityCol = "probability") {


reordering parameter to match common/expert param types

SparkQA · 2016-11-03T06:05:19Z

Test build #68049 has finished for PR 15746 at commit fc8bbe3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class GBTClassifierWrapperWriter(instance: GBTClassifierWrapper)
- class GBTClassifierWrapperReader extends MLReader[GBTClassifierWrapper]
- class GBTRegressorWrapperWriter(instance: GBTRegressorWrapper)
- class GBTRegressorWrapperReader extends MLReader[GBTRegressorWrapper]

shivaram · 2016-11-03T14:52:06Z

@mengxr @yanboliang Could you review this ? I'll try to take a look by end of this week.

yanboliang · 2016-11-03T15:01:38Z

Sure, I can make a pass tomorrow.

SparkQA · 2016-11-03T18:01:30Z

Test build #68072 has finished for PR 15746 at commit 1a317ac.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-11-04T15:01:31Z

R/pkg/R/mllib.R

+#' Gradient Boosted Tree model, \code{predict} to make predictions on new data, and
+#' \code{write.ml}/\code{read.ml} to save/load fitted models.
+#' For more details, see
+#' \href{http://spark.apache.org/docs/latest/ml-classification-regression.html}{GBT}


Directly link to http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-classifier and http://spark.apache.org/docs/latest/ml-classification-regression.html#gradient-boosted-tree-regression should be more clear?

I thought it's verbose to have 2 links, but I guess they are just links. Added.

yanboliang · 2016-11-04T15:05:57Z

R/pkg/R/mllib.R

+#' @param minInstancesPerNode Minimum number of instances each child must have after split. If a
+#'                            split causes the left or right child to have fewer than
+#'                            minInstancesPerNode, the split will be discarded as invalid. Should be
+#'                            >= 1.


(default = 1)

I was debating this. The other default text comes from Scala, I thought it would be nice to have one text but generally R doc text does not list the default value since it is clearly stated in the function signature right above on the Rd page.

So I'm removing all other "default = something" text unless they bring additional values (like explaining why).

This is the same in Python.

yanboliang · 2016-11-04T15:08:45Z

R/pkg/R/mllib.R

+#'                            split causes the left or right child to have fewer than
+#'                            minInstancesPerNode, the split will be discarded as invalid. Should be
+#'                            >= 1.
+#' @param minInfoGain Minimum information gain for a split to be considered at a tree node.


(default = 0.0)

yanboliang · 2016-11-04T15:09:08Z

R/pkg/R/mllib.R

+#'                            minInstancesPerNode, the split will be discarded as invalid. Should be
+#'                            >= 1.
+#' @param minInfoGain Minimum information gain for a split to be considered at a tree node.
+#' @param checkpointInterval Param for set checkpoint interval (>= 1) or disable checkpoint (-1).


(default = 10)

yanboliang · 2016-11-04T15:09:28Z

R/pkg/R/mllib.R

+#'                            >= 1.
+#' @param minInfoGain Minimum information gain for a split to be considered at a tree node.
+#' @param checkpointInterval Param for set checkpoint interval (>= 1) or disable checkpoint (-1).
+#' @param maxMemoryInMB Maximum memory in MB allocated to histogram aggregation.


(default = 256)

yanboliang · 2016-11-04T15:10:26Z

R/pkg/R/mllib.R

+#' @param minInfoGain Minimum information gain for a split to be considered at a tree node.
+#' @param checkpointInterval Param for set checkpoint interval (>= 1) or disable checkpoint (-1).
+#' @param maxMemoryInMB Maximum memory in MB allocated to histogram aggregation.
+#' @param cacheNodeIds If FALSE, the algorithm will pass trees to executors to match instances with


If TRUE, the algorithm will cache node IDs for each instance. (default = FALSE)
Caching can speed up training of deeper trees. Users can set how often should the cache be checkpointed or disable it by setting checkpointInterval.

yanboliang · 2016-11-04T15:25:55Z

mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala

+    if (seed != null && seed.length > 0) rfc.setSeed(seed.toLong)
+
+    val pipeline = new Pipeline()
+      .setStages(Array(rFormulaModel, rfc))


I think spark.gbt also need support to make binary classification based on dataset of string label such as Yes and No. This implementation will output double value when make prediction which may confuse users, and we should convert the double value back to the original string label. You can refer NaiveBayesWrapper to construct the pipeline. BTW, add R test for dataset of string label.

the existing test is string label, I'll add a test for numeric label

yanboliang · 2016-11-04T15:34:42Z

mllib/src/main/scala/org/apache/spark/ml/r/GBTClassificationWrapper.scala

+  val formula: String,
+  val features: Array[String]) extends MLWritable {
+
+  private val DTModel: GBTClassificationModel =


DTModel -> gbcModel should be better? The variable name should not start with an uppercase letter.

fixed, thx, copy/paste mistake

yanboliang · 2016-11-04T15:35:56Z

mllib/src/main/scala/org/apache/spark/ml/r/GBTRegressionWrapper.scala

+  val formula: String,
+  val features: Array[String]) extends MLWritable {
+
+  private val DTModel: GBTRegressionModel =


yanboliang · 2016-11-04T16:03:42Z

R/pkg/R/mllib.R


 #  Prints the summary of Random Forest Regression Model
-print.summary.randomForest <- function(x) {
+print.summary.treeEnsemble <- function(x) {


I think we should not call toDebugString and output the detail structure of trees. These informations are used to debug and it's not easy to understand for R users.

Possibly. What would you suggest we show?
In R, generally the evaluated error should be show in summary, we don't really have that handy. Also I seems to recall an ongoing issue on the lack of consistency (or lack of information) to display to R user, and it has been suggested we should have helper functions on the model so we could be consistent across the board in all languages (as supposed to on the R side only like print.summary.GeneralizedLinearRegressionModel)?

I feel like there is a lot of work we could be doing here.

For example, summary on rpart model shows both error and node-by-node information. I think it is still useful this way

Call: rpart(formula = Kyphosis ~ Age + Number + Start, data = kyphosis, method = "class") n= 81 CP nsplit rel error xerror xstd 1 0.17647059 0 1.0000000 1.0000000 0.2155872 2 0.01960784 1 0.8235294 0.9411765 0.2107780 3 0.01000000 4 0.7647059 1.0588235 0.2200975 Variable importance Start Age Number 64 24 12 Node number 1: 81 observations, complexity param=0.1764706 predicted class=absent expected loss=0.2098765 P(node) =1 class counts: 64 17 probabilities: 0.790 0.210 left son=2 (62 obs) right son=3 (19 obs) Primary splits: Start < 8.5 to the right, improve=6.762330, (0 missing) Number < 5.5 to the left, improve=2.866795, (0 missing) Age < 39.5 to the left, improve=2.250212, (0 missing) Surrogate splits: Number < 6.5 to the left, agree=0.802, adj=0.158, (0 split) Node number 2: 62 observations, complexity param=0.01960784 predicted class=absent expected loss=0.09677419 P(node) =0.7654321 class counts: 56 6 probabilities: 0.903 0.097 left son=4 (29 obs) right son=5 (33 obs) Primary splits: Start < 14.5 to the right, improve=1.0205280, (0 missing) Age < 55 to the left, improve=0.6848635, (0 missing) Number < 4.5 to the left, improve=0.2975332, (0 missing) Surrogate splits: Number < 3.5 to the left, agree=0.645, adj=0.241, (0 split) Age < 16 to the left, agree=0.597, adj=0.138, (0 split) Node number 3: 19 observations predicted class=present expected loss=0.4210526 P(node) =0.2345679 class counts: 8 11 probabilities: 0.421 0.579 Node number 4: 29 observations predicted class=absent expected loss=0 P(node) =0.3580247 class counts: 29 0 probabilities: 1.000 0.000 Node number 5: 33 observations, complexity param=0.01960784 predicted class=absent expected loss=0.1818182 P(node) =0.4074074 class counts: 27 6 probabilities: 0.818 0.182 left son=10 (12 obs) right son=11 (21 obs) Primary splits: Age < 55 to the left, improve=1.2467530, (0 missing) Start < 12.5 to the right, improve=0.2887701, (0 missing) Number < 3.5 to the right, improve=0.1753247, (0 missing) Surrogate splits: Start < 9.5 to the left, agree=0.758, adj=0.333, (0 split) Number < 5.5 to the right, agree=0.697, adj=0.167, (0 split) Node number 10: 12 observations predicted class=absent expected loss=0 P(node) =0.1481481 class counts: 12 0 probabilities: 1.000 0.000 Node number 11: 21 observations, complexity param=0.01960784 predicted class=absent expected loss=0.2857143 P(node) =0.2592593 class counts: 15 6 probabilities: 0.714 0.286 left son=22 (14 obs) right son=23 (7 obs) Primary splits: Age < 111 to the right, improve=1.71428600, (0 missing) Start < 12.5 to the right, improve=0.79365080, (0 missing) Number < 3.5 to the right, improve=0.07142857, (0 missing) Node number 22: 14 observations predicted class=absent expected loss=0.1428571 P(node) =0.1728395 class counts: 12 2 probabilities: 0.857 0.143 Node number 23: 7 observations predicted class=present expected loss=0.4285714 P(node) =0.08641975 class counts: 3 4 probabilities: 0.429 0.571

That's OK. I thought the output may be very large which will flood the screen. The output of toDebugString is also not very legible compared with rpart. I like the idea to make summary string consistent between languages. Let's get this in firstly and improve toDebugString at Scala side in a separate task which can also benefit SparkR.

I'll open a JIRA on that

opened SPARK-18348

…ic label (force index label & predicted label to string), tests

SparkQA · 2016-11-05T00:23:46Z

Test build #68166 has finished for PR 15746 at commit 94bdf73.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-06T18:23:36Z

any more thought on this?

yanboliang · 2016-11-07T10:39:16Z

R/pkg/inst/tests/testthat/test_mllib.R

+  iris2$NumericSpecies <- ifelse(iris2$Species == "setosa", 0, 1)
+  df <- suppressWarnings(createDataFrame(iris2))
+  m <- spark.gbt(df, NumericSpecies ~ ., type = "classification")
+  s <- summary(m)


It looks like we never use this line?

added a test, but this is mostly to make sure the call is not failing

yanboliang · 2016-11-07T10:55:12Z

R/pkg/inst/tests/testthat/test_mllib.R

+                                         68.655, 69.564, 69.331, 70.551),
+               tolerance = 1e-4)
+  stats <- summary(model)
+  expect_equal(stats$numTrees, 20)


Why only check numTrees? I think we should also check numFeatures, featureImportances and treeWeights at least. Any thoughts?

added. featureImportances is a bit tricky - in JVM it's a Vector and doesn't translate to something accessible in R (see SPARK-18226)

so for now featureImpoartances is converted to a string, and let's skip testing that for now

I see. Since there is no object represents ML Vector in SparkR currently, I'd like to convert the type of featureImportances from Vector to Array at GBTClassifierWrapper.scala.

lazy val featureImportances: Array = gbtcModel.featureImportances.toArray

Then it can be translated to R list. Users may sort or select the feature importances, so return as R list should make more sense. Any thoughts?

I think I tried that and it was really a SparseVector so converting to an Array made it fairly unreadable and unusable.

I think SparseVector should really map to a Map or a Properties.

yanboliang · 2016-11-07T11:10:08Z

R/pkg/R/mllib.R

+          function(object, path, overwrite = FALSE) {
+            write_internal(object, path, overwrite)
+          })
+


Perhaps add a line of annotation: Get the summary of a GBTRegressionModel model. I know it will not appear in R doc, it was used for developers to understand the code.

yanboliang · 2016-11-07T11:23:47Z

R/pkg/R/mllib.R

+          })
+
+#' @return \code{summary} returns the model's features as lists, depth and number of nodes
+#'                        or number of classes.


Should we clarify the return values more clear, such as feature importance, tree weights, number of trees, etc?

updated. I took a shot at updating other models but we have a lot of issues with details and consistency across all other ml models - I'll open a JIRA to track.

opened SPARK-18349

yanboliang · 2016-11-07T11:38:59Z

R/pkg/R/mllib.R


 #  Prints the summary of Random Forest Regression Model
-print.summary.randomForest <- function(x) {
+print.summary.treeEnsemble <- function(x) {


That's OK. I thought the output may be very large which will flood the screen. The output of toDebugString is also not very legible compared with rpart. I like the idea to make summary string consistent between languages. Let's get this in firstly and improve toDebugString at Scala side in a separate task which can also benefit SparkR.

yanboliang · 2016-11-07T11:44:25Z

@felixcheung I made another pass and left some minor comments, otherwise, looks good to me. Thanks.

…e of ml models

SparkQA · 2016-11-07T21:16:04Z

Test build #68295 has finished for PR 15746 at commit af400bc.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2016-11-07T22:06:39Z

Jenkins, retest this please

SparkQA · 2016-11-07T23:20:49Z

Test build #68302 has finished for PR 15746 at commit af400bc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

shivaram · 2016-11-08T18:25:17Z

The best sparse vector support in R comes from the Matrix package - But its a big package and I dont think we should add that as a dependency. We could try to do a wrapper where if the user already has the package installed we return it using Matrix ?

felixcheung · 2016-11-08T22:43:16Z

That's a great suggestion. I've added to SPARK-18131

felixcheung · 2016-11-09T00:01:23Z

merged to master and branch-2.1

## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15746 from felixcheung/rgbt. (cherry picked from commit 55964c1) Signed-off-by: Felix Cheung <felixcheung@apache.org>

## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes apache#15746 from felixcheung/rgbt.

GBT SparkR

fc8bbe3

felixcheung commented Nov 3, 2016

View reviewed changes

update example

1a317ac

yanboliang reviewed Nov 4, 2016

View reviewed changes

update from feedback - doc default values, classifications with numer…

94bdf73

…ic label (force index label & predicted label to string), tests

yanboliang reviewed Nov 7, 2016

View reviewed changes

Add more tests from feedback, took a shot at updating doc for a coupl…

af400bc

…e of ml models

asfgit closed this in 55964c1 Nov 9, 2016

[SPARK-18239][SPARKR] Gradient Boosted Tree for R #15746

[SPARK-18239][SPARKR] Gradient Boosted Tree for R #15746

Uh oh!

Conversation

felixcheung commented Nov 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

felixcheung Nov 3, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

shivaram commented Nov 3, 2016

Uh oh!

yanboliang commented Nov 3, 2016

Uh oh!

SparkQA commented Nov 3, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Nov 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Nov 4, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 5, 2016

Uh oh!

felixcheung commented Nov 6, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felixcheung Nov 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

felixcheung commented Nov 3, 2016 •

edited

Loading

felixcheung Nov 3, 2016 •

edited

Loading

felixcheung Nov 4, 2016 •

edited

Loading

felixcheung Nov 4, 2016 •

edited

Loading

felixcheung Nov 7, 2016 •

edited

Loading

yanboliang Nov 8, 2016 •

edited

Loading

felixcheung Nov 8, 2016 •

edited

Loading

felixcheung commented Nov 8, 2016 •

edited

Loading