-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-9612][ML][FOLLOWUP] fix GBT support weights if subsamplingRate<1 #27070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @imatiach-msft |
|
Test build #116016 has finished for PR 27070 at commit
|
|
Test build #116025 has finished for PR 27070 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable to me
| } | ||
|
|
||
| /** | ||
| * Train a random forest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update doc, eg:
Train a random forest with metadata.
also add description for metadata param
| expectedStddev, epsilon = 0.01) | ||
| // should ignore weight function for now | ||
| assert(baggedRDD.collect().forall(_.sampleWeight === 1.0)) | ||
| assert(baggedRDD.collect().forall(_.sampleWeight === 2.0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just trying to understand, why did the sample weight change in this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because this testsuite meet conditions: withReplacement=false, numSubsamples!=1,
it will call the modified convertToBaggedRDDSamplingWithoutReplacement,
and the extractSampleWeight here is (_: LabeledPoint) => 2.0, so output baggedPoints will have sampleWeight==2.0
imatiach-msft
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
| // transform nodeStatsAggregators array to (nodeIndex, nodeAggregateStats) pairs, | ||
| // which can be combined with other partition using `reduceByKey` | ||
| nodeStatsAggregators.view.zipWithIndex.map(_.swap).iterator | ||
| nodeStatsAggregators.iterator.zipWithIndex.map(_.swap) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IDEA editor always shows warnings on the two lines, change them to avoid warnings.
|
Test build #116063 has finished for PR 27070 at commit
|
|
Test build #116072 has finished for PR 27070 at commit
|
|
@imatiach-msft Thanks very much for your reviewing! |
|
Merged to master! Thanks all for reviewing! |
What changes were proposed in this pull request?
1, fix
BaggedPoint.convertToBaggedRDDwhensubsamplingRate < 1.02, reorg
RandomForest.runWithMetadatabtwWhy are the changes needed?
In GBT, Instance weights will be discarded if subsamplingRate<1
1,
baggedPoint: BaggedPoint[TreePoint]is used in the tree growth to find best split;2,
BaggedPoint[TreePoint]contains two weights:3, only the var
sampleWeightinBaggedPointis used, the varweightinTreePointis never used in finding splits;4, The method
BaggedPoint.convertToBaggedRDDwas changed in #21632, it was only for decisiontree, so only the following code path was changed;5, In #25926, I made GBT support weights, but only test it with default
subsamplingRate==1.GBT with
subsamplingRate<1will convert treePoints to baggedPoints viain which the orignial weights from
weightColwill be discarded and allsampleWeightare assigned default 1.0;Does this PR introduce any user-facing change?
No
How was this patch tested?
updated testsuites