Skip to content

Conversation

@VinceShieh
Copy link

@VinceShieh VinceShieh commented Aug 15, 2016

What changes were proposed in this pull request?

This patch improves the CrossValidator by adding a new training/validation split method -groupKFold, which splits data based on data group labels and makes sure that the same group is not in both testing and training sets.
This is necessary, for example when data is gathered from different subjects, i.e., learning person specific features. This method can create subject independent folds, so that we can train and test the model on different subjects. It will improve the generic ability of the model and avoid over-fitting for these use
cases.

How was this patch tested?

Unit test added to MLUtilsSuite.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@VinceShieh VinceShieh changed the title [SPARK-17055] add labelKFold to CrossValidator [SPARK-17055] [MLLIB] add labelKFold to CrossValidator Aug 15, 2016
@holdenk
Copy link
Contributor

holdenk commented Aug 15, 2016

Thanks for making this issue and PR :) The first thing before people are likely to have the bandwith to review this is we are switching all new ML development to Spark ML from MLlib so it might be good to retarget this on top of Spark ML.

@amueller
Copy link

Do you guys thing "label" is a good name for this? Or did you just take it from us? See the issue linked to above.

@VinceShieh
Copy link
Author

if one understands the underlying ideas behind this method (labelKFold), it's easy to take it as a class/category of data, though I do think it's not that straightforward, even a bit confusing, when I saw it the first time. @amueller

@VinceShieh
Copy link
Author

@holdenk thanks for your comments. :) You are right. But as you can see, this is a variant of kFold, so I think it's better to stay close to it, otherwise, it would seems confusing, dont you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This can't be since 2.0.0)
Let me comment on the change itself in the JIRA again. I understand the purpose now, but this is not about labels actually. We'd have to change the name and this wouldn't go in .mllib.

@holdenk
Copy link
Contributor

holdenk commented Aug 23, 2016

@VinceShieh there are kfold exists in Spark ML as well, and this PR could maybe go instead of trying to add it on to mllib (for which we don't plan to add new features anymore rather bug fixes only).

@MechCoder
Copy link
Contributor

Just FYI, we plan to rename "LabelKFold" to "GroupKFold" in the next version of sklearn as a label can mean several things. (including the target label)

@hqzizania
Copy link
Contributor

This work may be similar with SPARK-8971 which is another variation of KFold, and very significant in some cases. I suppose it is okay to add to .mllib like the latter PR, but we could add its use to CrossValidator in .ml. @sethah @MLnick @yanboliang
BTW, fortunately, it seems to be easier to implement than the kFoldStratified, as it does not need to change underlying codes, such as in rdd/PairRDDFunctions.

@VinceShieh
Copy link
Author

Updates:

  1. code refactoring. Rename the API to align with Sklearn changes
  2. add implementation in CrossValidator

@VinceShieh VinceShieh changed the title [SPARK-17055] [MLLIB] add labelKFold to CrossValidator [SPARK-17055] [MLLIB] add groupKFold to CrossValidator Sep 12, 2016
@finleyb
Copy link

finleyb commented Sep 12, 2016

@VinceShieh I was wondering if require in the groupKFold method of MLUtils should be a greater than or equal rather than less than or equal? I was testing this branch because I need this functionality for a ML task I am performing and I ran into the require. Thanks for implementing this!

Currently, only KFold is supported in cross validation. But in cases
when data is gathered from different subjects and we want to avoid
over-fitting groupKFold is more useful than kFold.

groupKFold is a variation of k-fold which ensures that same group data
is not in both testing and training sets.

Unit test -'test groupKFold', is also added in MLUtilsSuite

Signed-off-by: Vincent Xie <vincent.xie@intel.com>
Signed-off-by: VinceShieh <vincent.xie@intel.com>
@VinceShieh
Copy link
Author

@finleyb indeed, thank you for pointing it out. I have put it right and added a test to guard this issue. Many thanks. And feel free to let us know if you have any problem with this class or any requirement. :)

@rdelassus
Copy link

There is an infinite number of ways to make folds. Until now we had the mlutils kfold. You want to add the groupedKfold. But I don't think we should add one by one every folding method that can be useful, thus adding (like you did) "if my method else if this othermethod [...] else kfold". It would be far better to make the folding method independant from the crossvalidator class, and pass it as an argument for example.

@VinceShieh
Copy link
Author

@rdelassus Agree. There are a number of folding methods, so some code refractoring should be done if more folding methods are to be supported in the future. But for now, I guess we will just align with what we currently have in mllib. Thanks for your comments.

@srowen
Copy link
Member

srowen commented Oct 31, 2016

OK let's close this one for now.

@asfgit asfgit closed this in 330fda8 Dec 8, 2016
zifeif2 pushed a commit to zifeif2/spark that referenced this pull request Nov 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants