[SPARK-17055] [MLLIB] add groupKFold to CrossValidator #14640

VinceShieh · 2016-08-15T02:53:17Z

What changes were proposed in this pull request?

This patch improves the CrossValidator by adding a new training/validation split method -groupKFold, which splits data based on data group labels and makes sure that the same group is not in both testing and training sets.
This is necessary, for example when data is gathered from different subjects, i.e., learning person specific features. This method can create subject independent folds, so that we can train and test the model on different subjects. It will improve the generic ability of the model and avoid over-fitting for these use
cases.

How was this patch tested?

Unit test added to MLUtilsSuite.

AmplabJenkins · 2016-08-15T02:57:14Z

Can one of the admins verify this patch?

holdenk · 2016-08-15T17:15:38Z

Thanks for making this issue and PR :) The first thing before people are likely to have the bandwith to review this is we are switching all new ML development to Spark ML from MLlib so it might be good to retarget this on top of Spark ML.

amueller · 2016-08-22T22:23:42Z

Do you guys thing "label" is a good name for this? Or did you just take it from us? See the issue linked to above.

VinceShieh · 2016-08-23T01:40:50Z

if one understands the underlying ideas behind this method (labelKFold), it's easy to take it as a class/category of data, though I do think it's not that straightforward, even a bit confusing, when I saw it the first time. @amueller

VinceShieh · 2016-08-23T01:50:18Z

@holdenk thanks for your comments. :) You are right. But as you can see, this is a variant of kFold, so I think it's better to stay close to it, otherwise, it would seems confusing, dont you think?

srowen · 2016-08-23T08:58:02Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

(This can't be since 2.0.0)
Let me comment on the change itself in the JIRA again. I understand the purpose now, but this is not about labels actually. We'd have to change the name and this wouldn't go in .mllib.

holdenk · 2016-08-23T09:09:31Z

@VinceShieh there are kfold exists in Spark ML as well, and this PR could maybe go instead of trying to add it on to mllib (for which we don't plan to add new features anymore rather bug fixes only).

MechCoder · 2016-08-23T20:40:03Z

Just FYI, we plan to rename "LabelKFold" to "GroupKFold" in the next version of sklearn as a label can mean several things. (including the target label)

hqzizania · 2016-08-24T02:07:26Z

This work may be similar with SPARK-8971 which is another variation of KFold, and very significant in some cases. I suppose it is okay to add to .mllib like the latter PR, but we could add its use to CrossValidator in .ml. @sethah @MLnick @yanboliang
BTW, fortunately, it seems to be easier to implement than the kFoldStratified, as it does not need to change underlying codes, such as in rdd/PairRDDFunctions.

VinceShieh · 2016-09-01T08:08:42Z

Updates:

code refactoring. Rename the API to align with Sklearn changes
add implementation in CrossValidator

finleyb · 2016-09-12T11:00:50Z

@VinceShieh I was wondering if require in the groupKFold method of MLUtils should be a greater than or equal rather than less than or equal? I was testing this branch because I need this functionality for a ML task I am performing and I ran into the require. Thanks for implementing this!

Currently, only KFold is supported in cross validation. But in cases when data is gathered from different subjects and we want to avoid over-fitting groupKFold is more useful than kFold. groupKFold is a variation of k-fold which ensures that same group data is not in both testing and training sets. Unit test -'test groupKFold', is also added in MLUtilsSuite Signed-off-by: Vincent Xie <vincent.xie@intel.com> Signed-off-by: VinceShieh <vincent.xie@intel.com>

VinceShieh · 2016-09-13T07:06:25Z

@finleyb indeed, thank you for pointing it out. I have put it right and added a test to guard this issue. Many thanks. And feel free to let us know if you have any problem with this class or any requirement. :)

rdelassus · 2016-10-28T07:53:26Z

There is an infinite number of ways to make folds. Until now we had the mlutils kfold. You want to add the groupedKfold. But I don't think we should add one by one every folding method that can be useful, thus adding (like you did) "if my method else if this othermethod [...] else kfold". It would be far better to make the folding method independant from the crossvalidator class, and pass it as an argument for example.

VinceShieh · 2016-10-31T03:05:45Z

@rdelassus Agree. There are a number of folding methods, so some code refractoring should be done if more folding methods are to be supported in the future. But for now, I guess we will just align with what we currently have in mllib. Thanks for your comments.

srowen · 2016-10-31T09:28:44Z

OK let's close this one for now.

Closes apache#15689 Closes apache#14640 Closes apache#15917 Closes apache#16188 Closes apache#16206

VinceShieh changed the title ~~[SPARK-17055] add labelKFold to CrossValidator~~ [SPARK-17055] [MLLIB] add labelKFold to CrossValidator Aug 15, 2016

MechCoder mentioned this pull request Aug 22, 2016

LabelKFold -> GroupKFold? scikit-learn/scikit-learn#7210

Closed

srowen reviewed Aug 23, 2016
View reviewed changes

VinceShieh force-pushed the labelKFold2 branch from 461d696 to f249fd0 Compare September 1, 2016 07:46

VinceShieh changed the title ~~[SPARK-17055] [MLLIB] add labelKFold to CrossValidator~~ [SPARK-17055] [MLLIB] add groupKFold to CrossValidator Sep 12, 2016

VinceShieh force-pushed the labelKFold2 branch from f249fd0 to 21f6174 Compare September 13, 2016 07:02

HyukjinKwon mentioned this pull request Dec 8, 2016

[BUILD] Closing some stale/inappropriate PRs #16207

Closed

asfgit closed this in 330fda8 Dec 8, 2016

zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025

Close stale pull requests.

0de30bb

Closes apache#15689 Closes apache#14640 Closes apache#15917 Closes apache#16188 Closes apache#16206

[SPARK-17055] [MLLIB] add groupKFold to CrossValidator #14640

[SPARK-17055] [MLLIB] add groupKFold to CrossValidator #14640

Uh oh!

Conversation

VinceShieh commented Aug 15, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

AmplabJenkins commented Aug 15, 2016

Uh oh!

holdenk commented Aug 15, 2016

Uh oh!

amueller commented Aug 22, 2016

Uh oh!

VinceShieh commented Aug 23, 2016

Uh oh!

VinceShieh commented Aug 23, 2016

Uh oh!

srowen Aug 23, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 23, 2016

Uh oh!

MechCoder commented Aug 23, 2016

Uh oh!

hqzizania commented Aug 24, 2016

Uh oh!

VinceShieh commented Sep 1, 2016

Uh oh!

finleyb commented Sep 12, 2016

Uh oh!

VinceShieh commented Sep 13, 2016

Uh oh!

rdelassus commented Oct 28, 2016

Uh oh!

VinceShieh commented Oct 31, 2016

Uh oh!

srowen commented Oct 31, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

VinceShieh commented Aug 15, 2016 •

edited

Loading