-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #56545 has finished for PR 12577 at commit
|
|
It sounds like people are running into this often when using cross-validation - would it make sense to also mention this in the k-fold docstring or examples? (Just a minor suggestion not to block or anything). |
| .select(col($(predictionCol)).cast(DoubleType), col($(labelCol)).cast(DoubleType)) | ||
| .rdd. | ||
| map { case Row(prediction: Double, label: Double) => | ||
| .na.drop("any", if ($(dropNaN)) Seq($(predictionCol)) else Seq()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also drops null values. I'm not sure how likely this is to happen, but the documentation should probably note it drops NaN and null values. Also, should we add a test case to verify that null values are ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will add null to test cases. I don't think it's likely in practice. But actually if nulls do exist in the dataset, it's worse than NaN from a correctness point of view, as either a NPE will be thrown, or it will be treated as 0 => 0 squared error for that datapoint, but the denominator will still be added for the mean calculation. So MSE will be biased low.
|
From a high level, one concern is that this seems to be a bit of band-aid fix. If the only scenario where this is a problem is using ALS in cross validation then it would seem better to address the problem at its root. I am trying to think of other scenarios where a predictor would output |
|
@holdenk yes I think it makes sense to add something to docs on cross-val to illustrate use cases. |
|
@sethah you're correct, this is a a bit of a band-aid fix. However, the real fix is getting CrossValidator to handle cases like this in a principled and generic way (and/or to change the behavior of ALS in predicting missing users). But even fixing ALS for missing users, I think the issue will still arise for missing items. As for the "ideal" fix in CrossValidator, it seems to be from your JIRA comment that this will be fairly complex. So until we can fix that, users can not use ALS in cross-validation in many cases. I've kept it an expertParam and tried to highlight one should only use it when you know what you're doing. I think we could also deprecate this option once the "real" fix comes in CrossValidator... |
|
@MLnick Good points. In my mind there are two scenarios here:
If (1) is true then, as you said, we can think about deprecating this in the future since it may happen that we can think of no specific use case for it once (if?) ALS stops predicting NaNs on new data. If (2) is true, perhaps we should consider adding this to all evaluators? Again, I'd be interested to hear other use cases. One I thought of is a Naive Bayes classifier with no smoothing predicting on unseen words in text classification, but I wasn't able to produce a similar failure in the bit of time I spent on it. Either way, I think this is an improvement, but just wanted to be a bit more explicit on the why and how it might affect scope. |
|
@sethah @holdenk @jkbradley I thought about this some more. I can't realistically think of a scenario apart from the ALS one where handling NaNs in the evaluator is desirable. So actually I think this should rather go into ALS itself - I'll call the param something like |
|
Opened #12896 |
As discussed in SPARK-14489, when using
ALSModelto predict on a test set, the model returnsNaNwhen the user/item is in the test set but not the training set, since the model has not computed factor(s) for that user and/or item.This PR adds support to
RegressionEvaluatorto drop rows where the value ofpredictionColisNaN. This should not be used in the general case (since a bad regression model may produce a lot ofNaNs and one would not want to ignore those but rather fix the underlying issue), but allows ALS to be used in cross-validation settings even when this situation occurs (which may be quite common on larger, sparser datasets). Thus it is anexpertParamand the default isfalse.How was this patch tested?
New unit tests in
RegressionEvaluatorSuiteand doc string tests inevaluation.py.cc @srowen @sethah @jkbradley