[SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes #12360

yongtang · 2016-04-13T15:08:24Z

What changes were proposed in this pull request?

This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.

How was this patch tested?

Existing tests passed.

…r feature subset size instead of regexes This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and parseDouble, for the purpose of robustness and maintainability.

srowen · 2016-04-13T15:16:59Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

+  object integerFeatureSubsetStrategy {
+    def unapply(strategy: String): Option[Int] = try {
+      val number = strategy.toInt
+      if (0 < number) {


Nit: number > 0? and import java.lang.NumberFormatException

mengxr · 2016-04-13T16:35:15Z

ok to test

mengxr · 2016-04-13T16:39:13Z

Besides inline comments, we should also consider the behavior at 1.0. Since all is already an option, maybe we should treat 1.0 as numFeaturesPerNode = 1 instead of all. @jkbradley

SparkQA · 2016-04-13T17:12:48Z

Test build #55726 has finished for PR 12360 at commit 57456d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-13T20:54:39Z

@mengxr I'd prefer to treat "1" as numFeaturesPerNode = 1 and "1.0" as equivalent to "all" in order to match sklearn's semantics.

jkbradley · 2016-04-13T22:36:06Z

One more comment: I'd like to make sure there is always at least 1 feature being used. Could you please update the parsing and add that to the unit test in RandomForestSuite.scala? Thanks!

…r feature subset size instead of regexes Update to use Try and filter to simplify the code.

yongtang · 2016-04-14T04:48:18Z

Thanks @srowen @mengxr @jkbradley. The pull request has been updated. Please let me know if there are any further issues.

mengxr · 2016-04-14T05:02:12Z

LGTM pending Jenkins

SparkQA · 2016-04-14T05:27:36Z

Test build #55794 has finished for PR 12360 at commit ed346cd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-04-15T00:23:31Z

Merged into master. Thanks!

## What changes were proposed in this pull request? This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset. ## How was this patch tested? Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR. An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR. Author: Yong Tang <yong.tang.github@outlook.com> Closes #11989 from yongtang/SPARK-3724.

[SPARK-14565][ML] RandomForest should use parseInt and parseDouble fo…

57456d3

…r feature subset size instead of regexes This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and parseDouble, for the purpose of robustness and maintainability.

srowen reviewed Apr 13, 2016
View reviewed changes

[SPARK-14565][ML] RandomForest should use parseInt and parseDouble fo…

ed346cd

…r feature subset size instead of regexes Update to use Try and filter to simplify the code.

asfgit closed this in 01dd1f5 Apr 15, 2016

yongtang deleted the SPARK-14565 branch April 15, 2016 00:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes #12360

[SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes #12360

Uh oh!

yongtang commented Apr 13, 2016

Uh oh!

srowen Apr 13, 2016

Uh oh!

mengxr commented Apr 13, 2016

Uh oh!

mengxr commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

jkbradley commented Apr 13, 2016

Uh oh!

jkbradley commented Apr 13, 2016

Uh oh!

yongtang commented Apr 14, 2016

Uh oh!

mengxr commented Apr 14, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

mengxr commented Apr 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes #12360

[SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes #12360

Uh oh!

Conversation

yongtang commented Apr 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

srowen Apr 13, 2016

Choose a reason for hiding this comment

Uh oh!

mengxr commented Apr 13, 2016

Uh oh!

mengxr commented Apr 13, 2016

Uh oh!

SparkQA commented Apr 13, 2016

Uh oh!

jkbradley commented Apr 13, 2016

Uh oh!

jkbradley commented Apr 13, 2016

Uh oh!

yongtang commented Apr 14, 2016

Uh oh!

mengxr commented Apr 14, 2016

Uh oh!

SparkQA commented Apr 14, 2016

Uh oh!

mengxr commented Apr 15, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants