[Spark-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest #18428

ajaysaini725 · 2017-06-26T23:57:00Z

What changes were proposed in this pull request?

Added functionality for CrossValidator and TrainValidationSplit to persist nested estimators such as OneVsRest. Also added CrossValidator and TrainValidation split persistence to pyspark.

How was this patch tested?

Performed both cross validation and train validation split with a one vs. rest estimator and tested read/write functionality of the estimator parameter maps required by these meta-algorithms.

…rsist nested estimators such as OneVsRest.

ajaysaini725 · 2017-06-26T23:57:47Z

@jkbradley @thunterdb Could you please review this?

SparkQA · 2017-06-27T00:57:39Z

Test build #78664 has finished for PR 18428 at commit 8390437.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the PR! There's one catch we may be able to address later, but overall, I think my comments are all small.

jkbradley · 2017-06-27T01:45:15Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+          .setClassifier(new LogisticRegression)
+    val evaluator = new BinaryClassificationEvaluator()
+      .setMetricName("areaUnderPR")  // not default metric
+


style: remove extra newline

jkbradley · 2017-06-27T01:45:47Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+    val ova = new OneVsRest()
+          .setClassifier(new LogisticRegression)
+    val evaluator = new BinaryClassificationEvaluator()
+      .setMetricName("areaUnderPR")  // not default metric


Is this needed for this unit test?

jkbradley · 2017-06-27T01:45:49Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala


+  test("read/write: CrossValidator with nested estimator") {
+    val ova = new OneVsRest()
+          .setClassifier(new LogisticRegression)


style: fix indentation

jkbradley · 2017-06-27T01:47:16Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+    val classifier1 = new LogisticRegression().setRegParam(2.0)
+    val classifier2 = new LogisticRegression().setRegParam(3.0)
+    val paramMaps = new ParamGridBuilder()
+      .addGrid(ova.classifier, Array(classifier1, classifier2))


Add comment that it is important to test Param values which inherit from Params.

jkbradley · 2017-06-27T02:47:04Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+    cv2.getEstimator match {
+      case ova2: OneVsRest =>
+        assert(ova.uid === ova2.uid)
+        assert(ova.getClassifier.asInstanceOf[LogisticRegression].getMaxIter


Check type of classifier before casting

jkbradley · 2017-06-27T06:10:32Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

-          Map("parent" -> p.parent, "name" -> p.name, "value" -> p.jsonEncode(v))
+          v match {
+            case writeableObj: MLWritable =>
+              numParamsNotJson += 1


nit: move this down 1 line to index from 0

jkbradley · 2017-06-27T06:14:11Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+          v match {
+            case writeableObj: MLWritable =>
+              numParamsNotJson += 1
+              val paramPath = new Path(path, "param" + p.name + numParamsNotJson).toString


How about changing the prefix "param" -> "epm_"?

jkbradley · 2017-06-27T06:18:17Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

+              param -> value
+            } else {
+              val path = param.jsonDecode(pInfo("value")).toString
+              val value = DefaultParamsReader.loadParamsInstance[MLWritable](path, sc)


This is OK with me for now since it will address all cases I've seen. In the future, it'd be great to make this more general by allowing it to read any MLReadable type (not just DefaultParamsReadable). I'll comment in the save() section above about this too.

jkbradley · 2017-06-27T06:19:20Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

        paramMap.toSeq.map { case ParamPair(p, v) =>
-          Map("parent" -> p.parent, "name" -> p.name, "value" -> p.jsonEncode(v))
+          v match {
+            case writeableObj: MLWritable =>


Per my comment below in the load() section, this should be restricted to DefaultParamsWritable for now. Could you please do so, but also add a check which throws an error if v is MLWritable but not DefaultParamsWritable?

…sistence. Implemented python persistence for meta-algorithms. OneVsRest overrides necessary persistence functions. Code still has prints and comments that need to be cleaned up.

…d TrainValidationSplit now persist estimators in both Scala and Python.

SparkQA · 2017-06-29T22:39:33Z

Test build #78935 has finished for PR 18428 at commit a0dbf6c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-06-29T23:46:45Z

Test build #78936 has finished for PR 18428 at commit 253f39e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the update! The fixes for my original comments look good.

I did a pass over the new parts as well. My main question is whether we can eliminate more of the duplicated code.

I may be out of touch for a week, so please ping others as well. E.g. @yinxusen who worked on this long ago or @thunterdb or @sueann

jkbradley · 2017-07-02T06:43:40Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+        classifier match {
+          case lr: LogisticRegression =>
+            assert(ova.getClassifier.asInstanceOf[LogisticRegression].getMaxIter
+              === lr.asInstanceOf[LogisticRegression].getMaxIter)


lr is already of type LogisticRegression (no need to cast)

jkbradley · 2017-07-02T06:43:58Z

mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala

+        classifier match {
+          case lr: LogisticRegression =>
+            assert(ova.getClassifier.asInstanceOf[LogisticRegression].getMaxIter
+              === lr.asInstanceOf[LogisticRegression].getMaxIter)


lr is already of type LogisticRegression (no need to cast)

jkbradley · 2017-07-02T06:55:26Z

mllib/src/test/scala/org/apache/spark/ml/tuning/TrainValidationSplitSuite.scala

+   * Assert sequences of estimatorParamMaps are identical.
+   * Params must be simple types comparable with `===`.
+   */
+  def compareParamMaps(pMaps: Array[ParamMap], pMaps2: Array[ParamMap]): Unit = {


If this is the same as in CrossValidatorSuite, then can you please move them to a shared file (maybe ValidatorParamsSuite)?

jkbradley · 2017-07-02T07:00:29Z

mllib/src/test/scala/org/apache/spark/ml/tuning/CrossValidatorSuite.scala

+      .setEstimator(ova)
+      .setEvaluator(evaluator)
+      .setNumFolds(20)
+      .setEstimatorParamMaps(paramMaps)


Please compare the original + the loaded estimatorParamMaps

Same for the TrainValidationSplitSuite

jkbradley · 2017-07-02T07:03:34Z

python/pyspark/ml/classification.py


+    def _make_java_param_pair(self, param, value):
+        """
+        Makes a Java parm pair.


correct typo: parm -> param (and in original please)

jkbradley · 2017-07-02T07:25:48Z

python/pyspark/ml/tests.py

+        loadedModel = CrossValidatorModel.load(cvModelPath)
+        self.assertEqual(loadedModel.bestModel.uid, cvModel.bestModel.uid)
+
+    def test_save_load_nested_stimator(self):


fix typo "stimator"

jkbradley · 2017-07-02T07:28:56Z

python/pyspark/ml/tuning.py

        """
        return self.getOrDefault(self.evaluator)

+    def getEvaluator(self):


duplicate of above method?

jkbradley · 2017-07-02T07:30:51Z

python/pyspark/ml/tuning.py

+        return JavaMLWriter(self)
+
+    @since("2.3.0")
+    def save(self, path):


no need to copy this here; it can use the one in MLWritable

jkbradley · 2017-07-02T07:31:12Z

python/pyspark/ml/tuning.py

+        return JavaMLReader(cls)
+
+    @classmethod
+    @since("2.3.0")


Don't add since annotations to private methods

jkbradley · 2017-07-02T07:32:47Z

python/pyspark/ml/wrapper.py

        param = self._resolveParam(param)
        java_param = self._java_obj.getParam(param.name)
-        java_value = _py2java(sc, value)
+        if isinstance(value, Estimator) or isinstance(value, Model):


check for instances of JavaParams instead?

…eVsRest. Does not work because the make java param pair function in wrapper.py does not recognize the uid set in self._java_obj in the OneVsRest constructor.

SparkQA · 2017-07-06T21:10:38Z

Test build #79297 has finished for PR 18428 at commit 44342b2.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class OneVsRest(Estimator, OneVsRestParams, JavaMLReadable, JavaMLWritable):

SparkQA · 2017-07-06T22:07:27Z

Test build #79299 has finished for PR 18428 at commit 823593d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-07-12T01:43:52Z

I couldn't think of a great way to reduce code duplication between JavaWrapper and OneVsRest.

One thing I realized: This make break backwards compatibility. Let's fix that. We unfortunately don't have a good way to test backwards compatibility, so I'd recommend testing manually (saving a model before your patch and loading it back after your patch).

jkbradley · 2017-07-12T01:55:31Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

            val param = est.getParam(pInfo("name"))
-            val value = param.jsonDecode(pInfo("value"))
-            param -> value
+            if (pInfo("isJson").toBoolean.booleanValue()) {


I think fixing backwards compatibility will just mean testing for whether the field "isJson" is present here

jkbradley · 2017-07-12T01:57:44Z

Also, can you please add "OneVsRest" to the PR and JIRA titles since this touches that class?

SparkQA · 2017-07-12T19:24:51Z

Test build #79568 has finished for PR 18428 at commit f169aa5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-07-12T20:04:37Z

mllib/src/main/scala/org/apache/spark/ml/tuning/ValidatorParams.scala

-            val value = param.jsonDecode(pInfo("value"))
-            param -> value
+            if (!pInfo.contains("isJson") ||
+               (pInfo.contains("isJson") && pInfo("isJson").toBoolean.booleanValue())) {


style nit: indent line 202 +1 space

Also, could you please add a comment saying that SPARK-21221 introduced the "isJson" field?

jkbradley · 2017-07-12T22:26:13Z

Rats, one more thing: We need to use relative paths, not absolute ones, when we put paths in the persisted file. Could you please add a unit test which checks this, perhaps by saving a model, moving it, and then loading it?

SparkQA · 2017-07-12T23:27:33Z

Test build #79574 has finished for PR 18428 at commit 7601df7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…r as a relative path instead of absolute path.

SparkQA · 2017-07-14T02:17:42Z

Test build #79596 has finished for PR 18428 at commit 231ec55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

Thanks for the update! Just the style nit remains

jkbradley · 2017-07-14T21:08:03Z

mllib/src/test/scala/org/apache/spark/ml/tuning/ValidatorParamsSuiteHelpers.scala

+    Files.move(subDirWithUid.toPath, newSubdirWithUid.toPath, StandardCopyOption.ATOMIC_MOVE)
+
+    val loader = instance.getClass.getMethod("read")
+                  .invoke(null).asInstanceOf[MLReader[T]]


fix indentation

SparkQA · 2017-07-14T22:12:49Z

Test build #79622 has finished for PR 18428 at commit a6bd197.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-14T22:17:07Z

Test build #79623 has finished for PR 18428 at commit 6a7162d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-07-17T17:06:49Z

LGTM
Merging with master
Thanks @ajaysaini725 for the patch!

Added functionality for CrossValidator and TrainValidationSplit to pe…

8390437

…rsist nested estimators such as OneVsRest.

ajaysaini725 changed the title ~~[ML] CrossValidator and TrainValidationSplit Persist Nested Estimators~~ [Spark-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators Jun 27, 2017

jkbradley reviewed Jun 27, 2017

View reviewed changes

ajaysaini725 added 2 commits June 29, 2017 14:34

Responded to first round of code review on Scala nested estimator per…

76aece5

…sistence. Implemented python persistence for meta-algorithms. OneVsRest overrides necessary persistence functions. Code still has prints and comments that need to be cleaned up.

Cleaned up python meta algorithm persistence code. CrossValidation an…

a0dbf6c

…d TrainValidationSplit now persist estimators in both Scala and Python.

Small style fix.

253f39e

jkbradley reviewed Jul 2, 2017

View reviewed changes

ajaysaini725 added 3 commits July 6, 2017 13:40

Attempt at removing two of the duplicated persistence functions in On…

89be87e

…eVsRest. Does not work because the make java param pair function in wrapper.py does not recognize the uid set in self._java_obj in the OneVsRest constructor.

Made changes based on pull request comments.

44342b2

Added ValidatorParamsSuiteHelpers file.

823593d

jkbradley reviewed Jul 12, 2017

View reviewed changes

Fixed backwards compatibility issue

f169aa5

ajaysaini725 changed the title ~~[Spark-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators~~ [Spark-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest Jul 12, 2017

jkbradley reviewed Jul 12, 2017

View reviewed changes

Fixed small style change

7601df7

Changed saving of nested estimators to store the path of the estimato…

231ec55

…r as a relative path instead of absolute path.

jkbradley reviewed Jul 14, 2017

View reviewed changes

Fixed indentation

a6bd197

Better indentation fix

6a7162d

asfgit closed this in 7047f49 Jul 17, 2017

[Spark-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest #18428

[Spark-21221][ML] CrossValidator and TrainValidationSplit Persist Nested Estimators such as OneVsRest #18428

Uh oh!

Conversation

ajaysaini725 commented Jun 26, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ajaysaini725 commented Jun 26, 2017

Uh oh!

SparkQA commented Jun 27, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 29, 2017

Uh oh!

SparkQA commented Jun 29, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

SparkQA commented Jul 6, 2017

Uh oh!

jkbradley commented Jul 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jul 12, 2017

Uh oh!

SparkQA commented Jul 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Jul 12, 2017

Uh oh!

ajaysaini725 commented Jun 26, 2017 •

edited

Loading

jkbradley commented Jul 12, 2017 •

edited

Loading