[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest #27097

zhengruifeng · 2020-01-06T03:16:40Z

What changes were proposed in this pull request?

1, change convertToBaggedRDDSamplingWithReplacement to attach instance weights
2, make RF supports weights

Why are the changes needed?

weightCol is already exposed, while RF has not support weights.

Does this PR introduce any user-facing change?

Yes, new setters

How was this patch tested?

added testsuites

fix bagged

huaxingao · 2020-01-06T03:49:51Z

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

    }

-    val instances: RDD[Instance] = extractLabeledPoints(dataset, numClasses).map(_.toInstance)
+    val instances = extractInstances(dataset)


Is this better?

validateNumClasses(numClasses) val instances = extractInstances(dataset, numClasses)

huaxingao · 2020-01-06T03:54:36Z

mllib/src/test/scala/org/apache/spark/ml/classification/RandomForestClassifierSuite.scala

+      (20, 5, 1.0, 0.96),
+      (20, 10, 1.0, 0.96),
+      (20, 10, 0.95, 0.96)
+    )


I guess maybe also add different impurity in testParams?

Maybe also test a special case numTrees = 1?

with numTrees==1, RF is exactly the DecisionTree, which is already tested in DecisionTreeClassifierSuite/DecisionTreeRegressorSuite.

I guess maybe also add different impurity in testParams?

I guess current tests maybe enough, Testsuites for DT/GBT do not test impurity.

The reason I suggested testing different impurities is because when calculating best split, the impurity path (both entropy and gini) is affected by sample weight. However, after taking a look at the DecisionTree test, I saw both entropy and gini are tested with sample weight there, so this is already covered in DecisionTree test, no need to test here.

SparkQA · 2020-01-06T04:27:59Z

Test build #116125 has finished for PR 27097 at commit 32ec9a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-01-06T10:57:26Z

friendly ping @srowen @imatiach-msft

SparkQA · 2020-01-06T12:14:27Z

Test build #116160 has finished for PR 27097 at commit 14a57c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala

srowen

@huaxingao do you have thoughts? looks reasonably straightforward, and a long standing feature request

huaxingao · 2020-01-10T18:14:52Z

@srowen The change looks fine to me. Let me take another look later today or tomorrow.

huaxingao · 2020-01-11T03:53:56Z

LGTM :)

srowen · 2020-01-13T14:27:51Z

Jenkins, retest this please

SparkQA · 2020-01-13T15:43:10Z

Test build #116644 has finished for PR 27097 at commit 14a57c8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft

LGTM!

srowen · 2020-01-14T14:27:56Z

Merged to master

zhengruifeng · 2020-01-15T03:36:10Z

Thanks @srowen @imatiach-msft @huaxingao for reviewing!

init

32ec9a6

fix bagged

zhengruifeng added ML PYSPARK labels Jan 6, 2020

huaxingao reviewed Jan 6, 2020

View reviewed changes

nit

14a57c8

srowen reviewed Jan 6, 2020

View reviewed changes

mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala Show resolved Hide resolved

srowen reviewed Jan 10, 2020

View reviewed changes

imatiach-msft approved these changes Jan 13, 2020

View reviewed changes

srowen closed this in 9320011 Jan 14, 2020

zero323 mentioned this pull request Jan 15, 2020

Sync with changes merged after 6502c66025718bf45e0e2ee12398b7b92da41a0c zero323/pyspark-stubs#315

Closed

14 tasks

zhengruifeng deleted the rf_support_weight branch January 15, 2020 03:34

[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest #27097

[SPARK-9478][ML][PYSPARK] Add sample weights to Random Forest #27097

Uh oh!

Conversation

zhengruifeng commented Jan 6, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

huaxingao Jan 6, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jan 6, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jan 6, 2020

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jan 6, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jan 11, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 6, 2020

Uh oh!

zhengruifeng commented Jan 6, 2020

Uh oh!

SparkQA commented Jan 6, 2020

Uh oh!

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

huaxingao commented Jan 10, 2020

Uh oh!

huaxingao commented Jan 11, 2020

Uh oh!

srowen commented Jan 13, 2020

Uh oh!

SparkQA commented Jan 13, 2020

Uh oh!

imatiach-msft left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented Jan 14, 2020

Uh oh!

zhengruifeng commented Jan 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants