Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 16, 2019

What changes were proposed in this pull request?

common methods support extract weights

Why are the changes needed?

today more and more ML algs support weighting, add this method will make impls simple

Does this PR introduce any user-facing change?

no

How was this patch tested?

existing testsuites

}

/**
* Extract [[labelCol]], weightCol(if any) and [[featuresCol]] from the given dataset,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I place it in PredictorParam so that methods like GBTModel.evaluateEachIteration can reuse it in the future.

@SparkQA
Copy link

SparkQA commented Sep 16, 2019

Test build #110632 has finished for PR 25802 at commit fd33d3d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110717 has finished for PR 25802 at commit fd33d3d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110721 has finished for PR 25802 at commit 35a2bed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110743 has finished for PR 25802 at commit b36e7ea.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 17, 2019

Test build #110748 has finished for PR 25802 at commit b36e7ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

friendly ping @srowen
now more and more algs support sample-weighting, extractLabeledPoints are rarely used. We may need to add this method as an alternative to extractLabeledPoints.
When RF&GBT support weighting, it can be reused in them.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mostly refactoring, seems OK. One question below.

} else {
lit(1.0)
}
case _ => lit(1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn't have a weight column, does it mean there's no point in selecting lit(1.0) as a weight column as it will be unused? or do some algorithms not have a weight column but nevertheless have ways of using a weight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, if an alg do not have weightCol, it should not deal with weighting.
So, what about raising an exception instead of assign it to 1.0?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an error will occur elsewhere? is it necessary to handle it here vs just not making an empty col?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method will only be called internally, so I think it is update to the developers to decide whether to use it or not. If an algorithm (like GBT) do not support weighting now, it can use existing extractLabeledPoints instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I suppose I'm just concerned about the small overhead of adding an unused column.
You're saying that it's up to implementations to call the method they need, one with weights or not? yeah I agree, and they will call the right method in this change? If true, then do you even need this check? it will already fail (correctly) if the code is calling the wrong method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will remove this line. Impls that do not support weighting call this method should fail.

@SparkQA
Copy link

SparkQA commented Sep 23, 2019

Test build #111212 has finished for PR 25802 at commit e3991e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

if (isDefined(p.weightCol) && $(p.weightCol).nonEmpty) {
col($(p.weightCol)).cast(DoubleType)
} else {
lit(1.0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too do you need a weight col, if the implementation doesn't support it (and shouldn't be calling this method)? or is it different?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is different from the above place. Even if a ML impl supports weighting, its weightCol is not necessary to be set, in this case, lit(1) is used implictly. Current all algs supporting weighting deal with weightCol in this way.

@srowen
Copy link
Member

srowen commented Sep 24, 2019

Merged to master

@srowen srowen closed this in fff2e84 Sep 24, 2019
@zhengruifeng zhengruifeng deleted the add_extractInstances branch September 25, 2019 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants