From baba319fae615ffc1ebfe564f9ac520e701fdf20 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 28 Feb 2017 16:34:40 +0200 Subject: [PATCH 1/4] Initial cold start param doc for user guide --- docs/ml-collaborative-filtering.md | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+) diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md index cfe835172ab4..ee0ff3a11012 100644 --- a/docs/ml-collaborative-filtering.md +++ b/docs/ml-collaborative-filtering.md @@ -59,6 +59,31 @@ This approach is named "ALS-WR" and discussed in the paper It makes `regParam` less dependent on the scale of the dataset, so we can apply the best parameter learned from a sampled subset to the full dataset and expect similar performance. +### Cold-start strategy + +When making predictions using an `ALSModel`, it is common to encounter users and/or items in the +test dataset that were not present during training the model. This typically occurs in two +scenarios: + +1. In production, for new users or items that have no rating history and on which the model has not +been trained (this is the "cold start problem") +2. During cross-validation, the data is split between training and evaluation sets. When using +simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually +very common to encounter users and/or items in the evaluation set that are not in the training set + +By default, Spark assigns `NaN` predictions during `ALSModel.transform` when a user and/or item +factor is not present in the model. This can be useful in a production system, since it indicates +a new user or item, and so the system can make a decision on some fallback to use as the prediction. + +However, this is undesirable during cross-validation, since any `NaN` predicted values will result +in `NaN` results for the evaluation metric (for example when using `RegressionEvaluator`). +This makes model selection impossible. + +Spark allows users to set the `coldStartStrategy` parameter +to `drop` in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. +The resulting evaluation metric will then be computed over the non-`NaN` data and will be valid. +This is illustrated in the example below. + **Examples**
From cd923e2791692d9dd81d7186033a1bfe22aab80d Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 28 Feb 2017 16:47:03 +0200 Subject: [PATCH 2/4] Update examples --- .../java/org/apache/spark/examples/ml/JavaALSExample.java | 2 ++ examples/src/main/python/ml/als_example.py | 4 +++- .../main/scala/org/apache/spark/examples/ml/ALSExample.scala | 2 ++ 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java index 33ba668b32fc..81970b7c81f4 100644 --- a/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaALSExample.java @@ -103,6 +103,8 @@ public static void main(String[] args) { ALSModel model = als.fit(training); // Evaluate the model by computing the RMSE on the test data + // Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics + model.setColdStartStrategy("drop"); Dataset predictions = model.transform(test); RegressionEvaluator evaluator = new RegressionEvaluator() diff --git a/examples/src/main/python/ml/als_example.py b/examples/src/main/python/ml/als_example.py index 1a979ff5b5be..2e7214ed56f9 100644 --- a/examples/src/main/python/ml/als_example.py +++ b/examples/src/main/python/ml/als_example.py @@ -44,7 +44,9 @@ (training, test) = ratings.randomSplit([0.8, 0.2]) # Build the recommendation model using ALS on the training data - als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating") + # Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics + als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating", + coldStartStrategy="drop") model = als.fit(training) # Evaluate the model by computing the RMSE on the test data diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala index bb5d16360849..868f49b16f21 100644 --- a/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala +++ b/examples/src/main/scala/org/apache/spark/examples/ml/ALSExample.scala @@ -65,6 +65,8 @@ object ALSExample { val model = als.fit(training) // Evaluate the model by computing the RMSE on the test data + // Note we set cold start strategy to 'drop' to ensure we don't get NaN evaluation metrics + model.setColdStartStrategy("drop") val predictions = model.transform(test) val evaluator = new RegressionEvaluator() From 4c2c78c82101a2aec8f7f0634781869e1b4d0184 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 28 Feb 2017 16:57:28 +0200 Subject: [PATCH 3/4] Clean ip doc and add note about future strategies --- docs/ml-collaborative-filtering.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md index ee0ff3a11012..adc510aa4e94 100644 --- a/docs/ml-collaborative-filtering.md +++ b/docs/ml-collaborative-filtering.md @@ -81,8 +81,11 @@ This makes model selection impossible. Spark allows users to set the `coldStartStrategy` parameter to `drop` in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. -The resulting evaluation metric will then be computed over the non-`NaN` data and will be valid. -This is illustrated in the example below. +The evaluation metric will then be computed over the non-`NaN` data and will be valid. +Usage of this parameter is illustrated in the example below. + +**Note:** currently the supported cold start strategies are `nan` (the default behavior mentioned +above) and `drop`. Further strategies may be supported in future versions. **Examples** From c422d5892fe3c8ed2fd8c4f3bf4978b9ced2bb02 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Wed, 1 Mar 2017 20:43:11 +0200 Subject: [PATCH 4/4] Clean up review comments --- docs/ml-collaborative-filtering.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md index adc510aa4e94..58f2d4b531e7 100644 --- a/docs/ml-collaborative-filtering.md +++ b/docs/ml-collaborative-filtering.md @@ -66,7 +66,7 @@ test dataset that were not present during training the model. This typically occ scenarios: 1. In production, for new users or items that have no rating history and on which the model has not -been trained (this is the "cold start problem") +been trained (this is the "cold start problem"). 2. During cross-validation, the data is split between training and evaluation sets. When using simple random splits as in Spark's `CrossValidator` or `TrainValidationSplit`, it is actually very common to encounter users and/or items in the evaluation set that are not in the training set @@ -80,12 +80,12 @@ in `NaN` results for the evaluation metric (for example when using `RegressionEv This makes model selection impossible. Spark allows users to set the `coldStartStrategy` parameter -to `drop` in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. +to "drop" in order to drop any rows in the `DataFrame` of predictions that contain `NaN` values. The evaluation metric will then be computed over the non-`NaN` data and will be valid. Usage of this parameter is illustrated in the example below. -**Note:** currently the supported cold start strategies are `nan` (the default behavior mentioned -above) and `drop`. Further strategies may be supported in future versions. +**Note:** currently the supported cold start strategies are "nan" (the default behavior mentioned +above) and "drop". Further strategies may be supported in future. **Examples**