-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29095][ML] add extractInstances #25802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
|
|
||
| /** | ||
| * Extract [[labelCol]], weightCol(if any) and [[featuresCol]] from the given dataset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I place it in PredictorParam so that methods like GBTModel.evaluateEachIteration can reuse it in the future.
|
Test build #110632 has finished for PR 25802 at commit
|
|
retest this please |
|
Test build #110717 has finished for PR 25802 at commit
|
|
Test build #110721 has finished for PR 25802 at commit
|
|
Test build #110743 has finished for PR 25802 at commit
|
|
retest this please |
|
Test build #110748 has finished for PR 25802 at commit
|
|
friendly ping @srowen |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mostly refactoring, seems OK. One question below.
| } else { | ||
| lit(1.0) | ||
| } | ||
| case _ => lit(1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it doesn't have a weight column, does it mean there's no point in selecting lit(1.0) as a weight column as it will be unused? or do some algorithms not have a weight column but nevertheless have ways of using a weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, if an alg do not have weightCol, it should not deal with weighting.
So, what about raising an exception instead of assign it to 1.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an error will occur elsewhere? is it necessary to handle it here vs just not making an empty col?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this method will only be called internally, so I think it is update to the developers to decide whether to use it or not. If an algorithm (like GBT) do not support weighting now, it can use existing extractLabeledPoints instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I suppose I'm just concerned about the small overhead of adding an unused column.
You're saying that it's up to implementations to call the method they need, one with weights or not? yeah I agree, and they will call the right method in this change? If true, then do you even need this check? it will already fail (correctly) if the code is calling the wrong method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will remove this line. Impls that do not support weighting call this method should fail.
|
Test build #111212 has finished for PR 25802 at commit
|
| if (isDefined(p.weightCol) && $(p.weightCol).nonEmpty) { | ||
| col($(p.weightCol)).cast(DoubleType) | ||
| } else { | ||
| lit(1.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too do you need a weight col, if the implementation doesn't support it (and shouldn't be calling this method)? or is it different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is different from the above place. Even if a ML impl supports weighting, its weightCol is not necessary to be set, in this case, lit(1) is used implictly. Current all algs supporting weighting deal with weightCol in this way.
|
Merged to master |
What changes were proposed in this pull request?
common methods support extract weights
Why are the changes needed?
today more and more ML algs support weighting, add this method will make impls simple
Does this PR introduce any user-facing change?
no
How was this patch tested?
existing testsuites