Skip to content

Conversation

@MLnick
Copy link
Contributor

@MLnick MLnick commented Apr 21, 2016

As discussed in SPARK-14489, when using ALSModel to predict on a test set, the model returns NaN when the user/item is in the test set but not the training set, since the model has not computed factor(s) for that user and/or item.

This PR adds support to RegressionEvaluator to drop rows where the value of predictionCol is NaN. This should not be used in the general case (since a bad regression model may produce a lot of NaNs and one would not want to ignore those but rather fix the underlying issue), but allows ALS to be used in cross-validation settings even when this situation occurs (which may be quite common on larger, sparser datasets). Thus it is an expertParam and the default is false.

How was this patch tested?

New unit tests in RegressionEvaluatorSuite and doc string tests in evaluation.py.

cc @srowen @sethah @jkbradley

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56545 has finished for PR 12577 at commit bfe510c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Apr 21, 2016

It sounds like people are running into this often when using cross-validation - would it make sense to also mention this in the k-fold docstring or examples? (Just a minor suggestion not to block or anything).

.select(col($(predictionCol)).cast(DoubleType), col($(labelCol)).cast(DoubleType))
.rdd.
map { case Row(prediction: Double, label: Double) =>
.na.drop("any", if ($(dropNaN)) Seq($(predictionCol)) else Seq())
Copy link
Contributor

@sethah sethah Apr 21, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also drops null values. I'm not sure how likely this is to happen, but the documentation should probably note it drops NaN and null values. Also, should we add a test case to verify that null values are ignored?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, will add null to test cases. I don't think it's likely in practice. But actually if nulls do exist in the dataset, it's worse than NaN from a correctness point of view, as either a NPE will be thrown, or it will be treated as 0 => 0 squared error for that datapoint, but the denominator will still be added for the mean calculation. So MSE will be biased low.

@sethah
Copy link
Contributor

sethah commented Apr 21, 2016

From a high level, one concern is that this seems to be a bit of band-aid fix. If the only scenario where this is a problem is using ALS in cross validation then it would seem better to address the problem at its root. I am trying to think of other scenarios where a predictor would output NaN and whether this behavior is desirable in that case. In the Jira, @jkbradley mentioned linear models with too little regularization. Are there any other scenarios that come to mind? I'll give it some thought.

@MLnick
Copy link
Contributor Author

MLnick commented Apr 22, 2016

@holdenk yes I think it makes sense to add something to docs on cross-val to illustrate use cases.

@MLnick
Copy link
Contributor Author

MLnick commented Apr 22, 2016

@sethah you're correct, this is a a bit of a band-aid fix. However, the real fix is getting CrossValidator to handle cases like this in a principled and generic way (and/or to change the behavior of ALS in predicting missing users). But even fixing ALS for missing users, I think the issue will still arise for missing items.

As for the "ideal" fix in CrossValidator, it seems to be from your JIRA comment that this will be fairly complex. So until we can fix that, users can not use ALS in cross-validation in many cases.

I've kept it an expertParam and tried to highlight one should only use it when you know what you're doing. I think we could also deprecate this option once the "real" fix comes in CrossValidator...

@srowen @jkbradley ?

@sethah
Copy link
Contributor

sethah commented Apr 22, 2016

@MLnick Good points. In my mind there are two scenarios here:

  1. This is a quick fix for users wanting to do cross validation with ALS/recommenders in Spark, until we can fully address the issue in the ALS algorithm.
  2. This is an improvement to Evaluators in general, giving users flexibility to ignore NaNs in whatever scenario produces them.

If (1) is true then, as you said, we can think about deprecating this in the future since it may happen that we can think of no specific use case for it once (if?) ALS stops predicting NaNs on new data. If (2) is true, perhaps we should consider adding this to all evaluators? Again, I'd be interested to hear other use cases. One I thought of is a Naive Bayes classifier with no smoothing predicting on unseen words in text classification, but I wasn't able to produce a similar failure in the bit of time I spent on it. Either way, I think this is an improvement, but just wanted to be a bit more explicit on the why and how it might affect scope.

@MLnick
Copy link
Contributor Author

MLnick commented May 3, 2016

@sethah @holdenk @jkbradley I thought about this some more. I can't realistically think of a scenario apart from the ALS one where handling NaNs in the evaluator is desirable.

So actually I think this should rather go into ALS itself - I'll call the param something like unknownUserItemStrategy. The default can be to return NaN as it does currently. We can make an option (perhaps called "skip" or "filter") that filters out NaNs in the prediction DF. This would allow this option to be used in cross-validation. This will also make it extensible for future potential additions such as using the "average user" factor, or whatever other strategy.

@MLnick MLnick closed this May 4, 2016
@MLnick MLnick deleted the SPARK-14489-dropnan-evaluator branch May 4, 2016 06:53
@MLnick
Copy link
Contributor Author

MLnick commented May 4, 2016

Opened #12896

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants