-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17055] [MLLIB] add groupKFold to CrossValidator #14640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
Thanks for making this issue and PR :) The first thing before people are likely to have the bandwith to review this is we are switching all new ML development to Spark ML from MLlib so it might be good to retarget this on top of Spark ML. |
|
Do you guys thing "label" is a good name for this? Or did you just take it from us? See the issue linked to above. |
|
if one understands the underlying ideas behind this method (labelKFold), it's easy to take it as a class/category of data, though I do think it's not that straightforward, even a bit confusing, when I saw it the first time. @amueller |
|
@holdenk thanks for your comments. :) You are right. But as you can see, this is a variant of kFold, so I think it's better to stay close to it, otherwise, it would seems confusing, dont you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This can't be since 2.0.0)
Let me comment on the change itself in the JIRA again. I understand the purpose now, but this is not about labels actually. We'd have to change the name and this wouldn't go in .mllib.
|
@VinceShieh there are kfold exists in Spark ML as well, and this PR could maybe go instead of trying to add it on to mllib (for which we don't plan to add new features anymore rather bug fixes only). |
|
Just FYI, we plan to rename "LabelKFold" to "GroupKFold" in the next version of sklearn as a label can mean several things. (including the target label) |
|
This work may be similar with SPARK-8971 which is another variation of KFold, and very significant in some cases. I suppose it is okay to add to .mllib like the latter PR, but we could add its use to CrossValidator in .ml. @sethah @MLnick @yanboliang |
461d696 to
f249fd0
Compare
|
Updates:
|
|
@VinceShieh I was wondering if require in the groupKFold method of MLUtils should be a greater than or equal rather than less than or equal? I was testing this branch because I need this functionality for a ML task I am performing and I ran into the require. Thanks for implementing this! |
Currently, only KFold is supported in cross validation. But in cases when data is gathered from different subjects and we want to avoid over-fitting groupKFold is more useful than kFold. groupKFold is a variation of k-fold which ensures that same group data is not in both testing and training sets. Unit test -'test groupKFold', is also added in MLUtilsSuite Signed-off-by: Vincent Xie <vincent.xie@intel.com> Signed-off-by: VinceShieh <vincent.xie@intel.com>
f249fd0 to
21f6174
Compare
|
@finleyb indeed, thank you for pointing it out. I have put it right and added a test to guard this issue. Many thanks. And feel free to let us know if you have any problem with this class or any requirement. :) |
|
There is an infinite number of ways to make folds. Until now we had the mlutils kfold. You want to add the groupedKfold. But I don't think we should add one by one every folding method that can be useful, thus adding (like you did) "if my method else if this othermethod [...] else kfold". It would be far better to make the folding method independant from the crossvalidator class, and pass it as an argument for example. |
|
@rdelassus Agree. There are a number of folding methods, so some code refractoring should be done if more folding methods are to be supported in the future. But for now, I guess we will just align with what we currently have in mllib. Thanks for your comments. |
|
OK let's close this one for now. |
Closes apache#15689 Closes apache#14640 Closes apache#15917 Closes apache#16188 Closes apache#16206
What changes were proposed in this pull request?
This patch improves the CrossValidator by adding a new training/validation split method -groupKFold, which splits data based on data group labels and makes sure that the same group is not in both testing and training sets.
This is necessary, for example when data is gathered from different subjects, i.e., learning person specific features. This method can create subject independent folds, so that we can train and test the model on different subjects. It will improve the generic ability of the model and avoid over-fitting for these use
cases.
How was this patch tested?
Unit test added to MLUtilsSuite.