-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14565][ML] RandomForest should use parseInt and parseDouble for feature subset size instead of regexes #12360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…r feature subset size instead of regexes This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and parseDouble, for the purpose of robustness and maintainability.
| object integerFeatureSubsetStrategy { | ||
| def unapply(strategy: String): Option[Int] = try { | ||
| val number = strategy.toInt | ||
| if (0 < number) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: number > 0? and import java.lang.NumberFormatException
|
ok to test |
|
Besides inline comments, we should also consider the behavior at |
|
Test build #55726 has finished for PR 12360 at commit
|
|
@mengxr I'd prefer to treat "1" as numFeaturesPerNode = 1 and "1.0" as equivalent to "all" in order to match sklearn's semantics. |
|
One more comment: I'd like to make sure there is always at least 1 feature being used. Could you please update the parsing and add that to the unit test in RandomForestSuite.scala? Thanks! |
…r feature subset size instead of regexes Update to use Try and filter to simplify the code.
|
Thanks @srowen @mengxr @jkbradley. The pull request has been updated. Please let me know if there are any further issues. |
|
LGTM pending Jenkins |
|
Test build #55794 has finished for PR 12360 at commit
|
|
Merged into master. Thanks! |
## What changes were proposed in this pull request? This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset. ## How was this patch tested? Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR. An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR. Author: Yong Tang <yong.tang.github@outlook.com> Closes #11989 from yongtang/SPARK-3724.
What changes were proposed in this pull request?
This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and
parseDouble, for the purpose of robustness and maintainability.
How was this patch tested?
Existing tests passed.