[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577

MLnick · 2016-04-21T14:46:59Z

As discussed in SPARK-14489, when using ALSModel to predict on a test set, the model returns NaN when the user/item is in the test set but not the training set, since the model has not computed factor(s) for that user and/or item.

This PR adds support to RegressionEvaluator to drop rows where the value of predictionCol is NaN. This should not be used in the general case (since a bad regression model may produce a lot of NaNs and one would not want to ignore those but rather fix the underlying issue), but allows ALS to be used in cross-validation settings even when this situation occurs (which may be quite common on larger, sparser datasets). Thus it is an expertParam and the default is false.

How was this patch tested?

New unit tests in RegressionEvaluatorSuite and doc string tests in evaluation.py.

cc @srowen @sethah @jkbradley

SparkQA · 2016-04-21T15:23:45Z

Test build #56545 has finished for PR 12577 at commit bfe510c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-04-21T19:05:39Z

It sounds like people are running into this often when using cross-validation - would it make sense to also mention this in the k-fold docstring or examples? (Just a minor suggestion not to block or anything).

sethah · 2016-04-21T21:06:23Z

mllib/src/main/scala/org/apache/spark/ml/evaluation/RegressionEvaluator.scala

      .select(col($(predictionCol)).cast(DoubleType), col($(labelCol)).cast(DoubleType))
-      .rdd.
-      map { case Row(prediction: Double, label: Double) =>
+      .na.drop("any", if ($(dropNaN)) Seq($(predictionCol)) else Seq())


This also drops null values. I'm not sure how likely this is to happen, but the documentation should probably note it drops NaN and null values. Also, should we add a test case to verify that null values are ignored?

Good point, will add null to test cases. I don't think it's likely in practice. But actually if nulls do exist in the dataset, it's worse than NaN from a correctness point of view, as either a NPE will be thrown, or it will be treated as 0 => 0 squared error for that datapoint, but the denominator will still be added for the mean calculation. So MSE will be biased low.

sethah · 2016-04-21T22:18:58Z

From a high level, one concern is that this seems to be a bit of band-aid fix. If the only scenario where this is a problem is using ALS in cross validation then it would seem better to address the problem at its root. I am trying to think of other scenarios where a predictor would output NaN and whether this behavior is desirable in that case. In the Jira, @jkbradley mentioned linear models with too little regularization. Are there any other scenarios that come to mind? I'll give it some thought.

MLnick · 2016-04-22T06:45:19Z

@holdenk yes I think it makes sense to add something to docs on cross-val to illustrate use cases.

MLnick · 2016-04-22T06:48:34Z

@sethah you're correct, this is a a bit of a band-aid fix. However, the real fix is getting CrossValidator to handle cases like this in a principled and generic way (and/or to change the behavior of ALS in predicting missing users). But even fixing ALS for missing users, I think the issue will still arise for missing items.

As for the "ideal" fix in CrossValidator, it seems to be from your JIRA comment that this will be fairly complex. So until we can fix that, users can not use ALS in cross-validation in many cases.

I've kept it an expertParam and tried to highlight one should only use it when you know what you're doing. I think we could also deprecate this option once the "real" fix comes in CrossValidator...

@srowen @jkbradley ?

sethah · 2016-04-22T20:41:53Z

@MLnick Good points. In my mind there are two scenarios here:

This is a quick fix for users wanting to do cross validation with ALS/recommenders in Spark, until we can fully address the issue in the ALS algorithm.
This is an improvement to Evaluators in general, giving users flexibility to ignore NaNs in whatever scenario produces them.

If (1) is true then, as you said, we can think about deprecating this in the future since it may happen that we can think of no specific use case for it once (if?) ALS stops predicting NaNs on new data. If (2) is true, perhaps we should consider adding this to all evaluators? Again, I'd be interested to hear other use cases. One I thought of is a Naive Bayes classifier with no smoothing predicting on unseen words in text classification, but I wasn't able to produce a similar failure in the bit of time I spent on it. Either way, I think this is an improvement, but just wanted to be a bit more explicit on the why and how it might affect scope.

MLnick · 2016-05-03T19:51:52Z

@sethah @holdenk @jkbradley I thought about this some more. I can't realistically think of a scenario apart from the ALS one where handling NaNs in the evaluator is desirable.

So actually I think this should rather go into ALS itself - I'll call the param something like unknownUserItemStrategy. The default can be to return NaN as it does currently. We can make an option (perhaps called "skip" or "filter") that filters out NaNs in the prediction DF. This would allow this option to be used in cross-validation. This will also make it extensible for future potential additions such as using the "average user" factor, or whatever other strategy.

MLnick · 2016-05-04T10:42:54Z

Opened #12896

Support dropping NaN predicted values in RegressionEvaluator

bfe510c

sethah reviewed Apr 21, 2016
View reviewed changes

MLnick closed this May 4, 2016

MLnick deleted the SPARK-14489-dropnan-evaluator branch May 4, 2016 06:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577

[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577

Uh oh!

MLnick commented Apr 21, 2016 •

edited

Loading

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

holdenk commented Apr 21, 2016

Uh oh!

sethah Apr 21, 2016 •

edited

Loading

Uh oh!

MLnick Apr 22, 2016

Uh oh!

sethah commented Apr 21, 2016

Uh oh!

MLnick commented Apr 22, 2016

Uh oh!

MLnick commented Apr 22, 2016

Uh oh!

sethah commented Apr 22, 2016

Uh oh!

MLnick commented May 3, 2016

Uh oh!

MLnick commented May 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577

[SPARK-14489][SPARK-14153][ML][PYSPARK] Support dropping NaN predicted values in RegressionEvaluator #12577

Uh oh!

Conversation

MLnick commented Apr 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How was this patch tested?

Uh oh!

SparkQA commented Apr 21, 2016

Uh oh!

holdenk commented Apr 21, 2016

Uh oh!

sethah Apr 21, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick Apr 22, 2016

Choose a reason for hiding this comment

Uh oh!

sethah commented Apr 21, 2016

Uh oh!

MLnick commented Apr 22, 2016

Uh oh!

MLnick commented Apr 22, 2016

Uh oh!

sethah commented Apr 22, 2016

Uh oh!

MLnick commented May 3, 2016

Uh oh!

MLnick commented May 4, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MLnick commented Apr 21, 2016 •

edited

Loading

sethah Apr 21, 2016 •

edited

Loading