[SPARK-29095][ML] add extractInstances #25802

zhengruifeng · 2019-09-16T09:12:51Z

What changes were proposed in this pull request?

common methods support extract weights

Why are the changes needed?

today more and more ML algs support weighting, add this method will make impls simple

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing testsuites

zhengruifeng · 2019-09-16T09:14:42Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

  }
+
+  /**
+   * Extract [[labelCol]], weightCol(if any) and [[featuresCol]] from the given dataset,


I place it in PredictorParam so that methods like GBTModel.evaluateEachIteration can reuse it in the future.

SparkQA · 2019-09-16T20:20:26Z

Test build #110632 has finished for PR 25802 at commit fd33d3d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-17T02:27:29Z

retest this please

SparkQA · 2019-09-17T03:33:47Z

Test build #110717 has finished for PR 25802 at commit fd33d3d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-17T04:17:02Z

Test build #110721 has finished for PR 25802 at commit 35a2bed.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-17T07:05:02Z

Test build #110743 has finished for PR 25802 at commit b36e7ea.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-17T07:15:51Z

retest this please

SparkQA · 2019-09-17T08:26:43Z

Test build #110748 has finished for PR 25802 at commit b36e7ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2019-09-18T08:03:57Z

friendly ping @srowen
now more and more algs support sample-weighting, extractLabeledPoints are rarely used. We may need to add this method as an alternative to extractLabeledPoints.
When RF&GBT support weighting, it can be reused in them.

srowen

As mostly refactoring, seems OK. One question below.

srowen · 2019-09-18T14:17:55Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

+        } else {
+          lit(1.0)
+        }
+      case _ => lit(1.0)


If it doesn't have a weight column, does it mean there's no point in selecting lit(1.0) as a weight column as it will be unused? or do some algorithms not have a weight column but nevertheless have ways of using a weight?

You are right, if an alg do not have weightCol, it should not deal with weighting.
So, what about raising an exception instead of assign it to 1.0?

I think an error will occur elsewhere? is it necessary to handle it here vs just not making an empty col?

Since this method will only be called internally, so I think it is update to the developers to decide whether to use it or not. If an algorithm (like GBT) do not support weighting now, it can use existing extractLabeledPoints instead.

OK, I suppose I'm just concerned about the small overhead of adding an unused column.
You're saying that it's up to implementations to call the method they need, one with weights or not? yeah I agree, and they will call the right method in this change? If true, then do you even need this check? it will already fail (correctly) if the code is calling the wrong method.

OK, I will remove this line. Impls that do not support weighting call this method should fail.

SparkQA · 2019-09-23T11:19:58Z

Test build #111212 has finished for PR 25802 at commit e3991e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-09-23T13:24:40Z

mllib/src/main/scala/org/apache/spark/ml/Predictor.scala

+        if (isDefined(p.weightCol) && $(p.weightCol).nonEmpty) {
+          col($(p.weightCol)).cast(DoubleType)
+        } else {
+          lit(1.0)


Here too do you need a weight col, if the implementation doesn't support it (and shouldn't be calling this method)? or is it different?

It is different from the above place. Even if a ML impl supports weighting, its weightCol is not necessary to be set, in this case, lit(1) is used implictly. Current all algs supporting weighting deal with weightCol in this way.

srowen · 2019-09-24T14:24:17Z

Merged to master

create pr

fd33d3d

zhengruifeng commented Sep 16, 2019

View reviewed changes

zhengruifeng added the ML label Sep 16, 2019

cast labelCol

35a2bed

update nb

b36e7ea

srowen reviewed Sep 18, 2019

View reviewed changes

del unused lit(1.0) for algs do not support weighting

e3991e7

srowen reviewed Sep 23, 2019

View reviewed changes

srowen closed this in fff2e84 Sep 24, 2019

zhengruifeng deleted the add_extractInstances branch September 25, 2019 02:31

[SPARK-29095][ML] add extractInstances #25802

[SPARK-29095][ML] add extractInstances #25802

Uh oh!

Conversation

zhengruifeng commented Sep 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 16, 2019

Uh oh!

zhengruifeng commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

zhengruifeng commented Sep 17, 2019

Uh oh!

SparkQA commented Sep 17, 2019

Uh oh!

zhengruifeng commented Sep 18, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 23, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 24, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhengruifeng commented Sep 16, 2019 •

edited

Loading