Skip to content

Conversation

@yongtang
Copy link
Contributor

What changes were proposed in this pull request?

This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.

How was this patch tested?

Existing tests passed.

…r feature subset size instead of regexes

This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.
object integerFeatureSubsetStrategy {
def unapply(strategy: String): Option[Int] = try {
val number = strategy.toInt
if (0 < number) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: number > 0? and import java.lang.NumberFormatException

@mengxr
Copy link
Contributor

mengxr commented Apr 13, 2016

ok to test

@mengxr
Copy link
Contributor

mengxr commented Apr 13, 2016

Besides inline comments, we should also consider the behavior at 1.0. Since all is already an option, maybe we should treat 1.0 as numFeaturesPerNode = 1 instead of all. @jkbradley

@SparkQA
Copy link

SparkQA commented Apr 13, 2016

Test build #55726 has finished for PR 12360 at commit 57456d3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

@mengxr I'd prefer to treat "1" as numFeaturesPerNode = 1 and "1.0" as equivalent to "all" in order to match sklearn's semantics.

@jkbradley
Copy link
Member

One more comment: I'd like to make sure there is always at least 1 feature being used. Could you please update the parsing and add that to the unit test in RandomForestSuite.scala? Thanks!

…r feature subset size instead of regexes

Update to use Try and filter to simplify the code.
@yongtang
Copy link
Contributor Author

Thanks @srowen @mengxr @jkbradley. The pull request has been updated. Please let me know if there are any further issues.

@mengxr
Copy link
Contributor

mengxr commented Apr 14, 2016

LGTM pending Jenkins

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55794 has finished for PR 12360 at commit ed346cd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Apr 15, 2016

Merged into master. Thanks!

@asfgit asfgit closed this in 01dd1f5 Apr 15, 2016
@yongtang yongtang deleted the SPARK-14565 branch April 15, 2016 00:26
yongtang referenced this pull request Apr 15, 2016
## What changes were proposed in this pull request?

This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.

In this PR, `featureSubsetStrategy` could be passed with:
a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset,
b)  an integer number (`>0`) that represents the number of features in each subset.

## How was this patch tested?

Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR.
An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #11989 from yongtang/SPARK-3724.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants