-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest #27097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| val instances: RDD[Instance] = extractLabeledPoints(dataset, numClasses).map(_.toInstance) | ||
| val instances = extractInstances(dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this better?
validateNumClasses(numClasses)
val instances = extractInstances(dataset, numClasses)
| (20, 5, 1.0, 0.96), | ||
| (20, 10, 1.0, 0.96), | ||
| (20, 10, 0.95, 0.96) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess maybe also add different impurity in testParams?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also test a special case numTrees = 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with numTrees==1, RF is exactly the DecisionTree, which is already tested in DecisionTreeClassifierSuite/DecisionTreeRegressorSuite.
I guess maybe also add different impurity in testParams?
I guess current tests maybe enough, Testsuites for DT/GBT do not test impurity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I suggested testing different impurities is because when calculating best split, the impurity path (both entropy and gini) is affected by sample weight. However, after taking a look at the DecisionTree test, I saw both entropy and gini are tested with sample weight there, so this is already covered in DecisionTree test, no need to test here.
|
Test build #116125 has finished for PR 27097 at commit
|
|
friendly ping @srowen @imatiach-msft |
|
Test build #116160 has finished for PR 27097 at commit
|
mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala
Show resolved
Hide resolved
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao do you have thoughts? looks reasonably straightforward, and a long standing feature request
|
@srowen The change looks fine to me. Let me take another look later today or tomorrow. |
|
LGTM :) |
|
Jenkins, retest this please |
|
Test build #116644 has finished for PR 27097 at commit
|
imatiach-msft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
Merged to master |
|
Thanks @srowen @imatiach-msft @huaxingao for reviewing! |
What changes were proposed in this pull request?
1, change
convertToBaggedRDDSamplingWithReplacementto attach instance weights2, make RF supports weights
Why are the changes needed?
weightColis already exposed, while RF has not support weights.Does this PR introduce any user-facing change?
Yes, new setters
How was this patch tested?
added testsuites