From 80872b333fca2542e46b064551eadc5a9a3741b2 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Mon, 27 Jun 2016 15:19:42 +0200 Subject: [PATCH 1/8] Add breaking changes to ML migration guide --- docs/mllib-guide.md | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index c28d13732eed..6610bbe3aa0c 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -104,10 +104,12 @@ and the migration guide below will explain all changes between releases. ## From 1.6 to 2.0 +### Breaking changes The deprecations and changes of behavior in the `spark.mllib` or `spark.ml` packages include: -Deprecations: +There were several breaking changes in Spark 2.0, outlined below. +**Linear algebra classes for DataFrame-based APIs** * [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984): In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated. * [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784): @@ -125,8 +127,31 @@ Deprecations: In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`. * In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`. -Changes of behavior: +Spark's linear algebra dependencies were moved to a new project, `spark-mllib-local` (see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). +As part of this change, the linear algebra classes were moved to a new package, `spark.ml.linalg`. +The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, leading to a few breaking changes, predominantly in various model classes +(see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list). +**Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`. + +**Deprecated methods removed** + +Several deprecated methods were removed. + +In `spark.ml`: + +* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator` +* `weights` in `LinearRegression` and `LogisticRegression` + +In `spark.mllib`: + +* `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`) +* `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available on `RDD`s directly, and were marked as `DeveloperApi`) +* `defaultStategy` in `mllib.tree.configuration.Strategy` +* `build` in `mllib.tree.Node` +* libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils` + +A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810). * [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780): `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now. This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`: From 177a9e005a77aa851b6d8b78e98de67a95524257 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 28 Jun 2016 21:09:32 +0200 Subject: [PATCH 2/8] initial work --- docs/mllib-guide.md | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 6610bbe3aa0c..77daca9d1b42 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -134,6 +134,10 @@ The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` class **Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`. +_Converting vectors and matrices_ + +ghdhgf + **Deprecated methods removed** Several deprecated methods were removed. From 6ef09a31c4e8808277357359ac6d048d866ce9f0 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 28 Jun 2016 22:29:16 +0200 Subject: [PATCH 3/8] Update guide for vector/matrix conversion and clean up --- docs/mllib-guide.md | 105 ++++++++++++++++++++++++++++++++------------ 1 file changed, 78 insertions(+), 27 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 77daca9d1b42..89ebfe03fb93 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -105,50 +105,73 @@ and the migration guide below will explain all changes between releases. ## From 1.6 to 2.0 ### Breaking changes -The deprecations and changes of behavior in the `spark.mllib` or `spark.ml` packages include: -There were several breaking changes in Spark 2.0, outlined below. +There were several breaking changes in Spark 2.0, which are outlined below. **Linear algebra classes for DataFrame-based APIs** -* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984): - In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated. -* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784): - In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`, - the `numTrees` parameter has been deprecated in favor of `getNumTrees` method. -* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761): - In `spark.ml.param.Params`, the `validateParams` method has been deprecated. - We move all functionality in overridden methods to the corresponding `transformSchema`. -* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829): - In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated. - We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`. -* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900): - In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`. -* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644): - In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`. -* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`. -Spark's linear algebra dependencies were moved to a new project, `spark-mllib-local` (see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). +Spark's linear algebra dependencies were moved to a new project, `spark-mllib-local` +(see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). As part of this change, the linear algebra classes were moved to a new package, `spark.ml.linalg`. -The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, leading to a few breaking changes, predominantly in various model classes +The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, +leading to a few breaking changes, predominantly in various model classes (see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list). **Note:** the RDD-based APIs in `spark.mllib` continue to depend on the previous package `spark.mllib.linalg`. _Converting vectors and matrices_ -ghdhgf +While most pipeline components support backward compatibility for loading, +some existing `DataFrames` and pipelines in Spark versions prior to 2.0, that contain vector or matrix +columns, may need to be migrated to the new `spark.ml` vector and matrix types. +Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg` types +(and vice versa) can be found in `spark.mllib.util.MLUtils`. -**Deprecated methods removed** +
+
-Several deprecated methods were removed. +{% highlight scala %} +import org.apache.spark.mllib.util.MLUtils -In `spark.ml`: +val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); +val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); +{% endhighlight %} -* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator` -* `weights` in `LinearRegression` and `LogisticRegression` +Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail. +
+ +
+ +{% highlight java %} +import org.apache.spark.mllib.util.MLUtils; +import org.apache.spark.sql.Dataset; + +Dataset convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); +Dataset convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); +{% endhighlight %} + +Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail. +
+ +
+ +{% highlight python %} +from pyspark.mllib.util import MLUtils + +convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) +convertedMatrxDF = MLUtils.convertMatrixColumnsToML(matrixDF) +{% endhighlight %} + +Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further detail. +
+
-In `spark.mllib`: +**Deprecated methods removed** + +Several deprecated methods were removed in the `spark.mllib` and `spark.ml` packages: +* `setScoreCol` in `ml.evaluation.BinaryClassificationEvaluator` +* `weights` in `LinearRegression` and `LogisticRegression` in `spark.ml` * `setMaxNumIterations` in `mllib.optimization.LBFGS` (marked as `DeveloperApi`) * `treeReduce` and `treeAggregate` in `mllib.rdd.RDDFunctions` (these functions are available on `RDD`s directly, and were marked as `DeveloperApi`) * `defaultStategy` in `mllib.tree.configuration.Strategy` @@ -156,6 +179,34 @@ In `spark.mllib`: * libsvm loaders for multiclass and load/save labeledData methods in `mllib.util.MLUtils` A full list of breaking changes can be found at [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810). + +### Deprecations and changes of behavior + +**Deprecations** + +Deprecations in the `spark.mllib` and `spark.ml` packages include: + +* [SPARK-14984](https://issues.apache.org/jira/browse/SPARK-14984): + In `spark.ml.regression.LinearRegressionSummary`, the `model` field has been deprecated. +* [SPARK-13784](https://issues.apache.org/jira/browse/SPARK-13784): + In `spark.ml.regression.RandomForestRegressionModel` and `spark.ml.classification.RandomForestClassificationModel`, + the `numTrees` parameter has been deprecated in favor of `getNumTrees` method. +* [SPARK-13761](https://issues.apache.org/jira/browse/SPARK-13761): + In `spark.ml.param.Params`, the `validateParams` method has been deprecated. + We move all functionality in overridden methods to the corresponding `transformSchema`. +* [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829): + In `spark.mllib` package, `LinearRegressionWithSGD`, `LassoWithSGD`, `RidgeRegressionWithSGD` and `LogisticRegressionWithSGD` have been deprecated. + We encourage users to use `spark.ml.regression.LinearRegresson` and `spark.ml.classification.LogisticRegresson`. +* [SPARK-14900](https://issues.apache.org/jira/browse/SPARK-14900): + In `spark.mllib.evaluation.MulticlassMetrics`, the parameters `precision`, `recall` and `fMeasure` have been deprecated in favor of `accuracy`. +* [SPARK-15644](https://issues.apache.org/jira/browse/SPARK-15644): + In `spark.ml.util.MLReader` and `spark.ml.util.MLWriter`, the `context` method has been deprecated in favor of `session`. +* In `spark.ml.feature.ChiSqSelectorModel`, the `setLabelCol` method has been deprecated since it was not used by `ChiSqSelectorModel`. + +**Changes of behavior** + +Changes of behavior in the `spark.mllib` and `spark.ml` packages include: + * [SPARK-7780](https://issues.apache.org/jira/browse/SPARK-7780): `spark.mllib.classification.LogisticRegressionWithLBFGS` directly calls `spark.ml.classification.LogisticRegresson` for binary classification now. This will introduce the following behavior changes for `spark.mllib.classification.LogisticRegressionWithLBFGS`: From ac49f31cd83aa2755e0f1948f981beee66a16527 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Wed, 29 Jun 2016 12:52:17 +0200 Subject: [PATCH 4/8] review comment and add single instance conversion details --- docs/mllib-guide.md | 23 +++++++++++++++++++---- 1 file changed, 19 insertions(+), 4 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 89ebfe03fb93..2832a43b6459 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -110,9 +110,9 @@ There were several breaking changes in Spark 2.0, which are outlined below. **Linear algebra classes for DataFrame-based APIs** -Spark's linear algebra dependencies were moved to a new project, `spark-mllib-local` +Spark's linear algebra dependencies were moved to a new project, `mllib-local` (see [SPARK-13944](https://issues.apache.org/jira/browse/SPARK-13944)). -As part of this change, the linear algebra classes were moved to a new package, `spark.ml.linalg`. +As part of this change, the linear algebra classes were copied to a new package, `spark.ml.linalg`. The DataFrame-based APIs in `spark.ml` now depend on the `spark.ml.linalg` classes, leading to a few breaking changes, predominantly in various model classes (see [SPARK-14810](https://issues.apache.org/jira/browse/SPARK-14810) for a full list). @@ -127,14 +127,24 @@ columns, may need to be migrated to the new `spark.ml` vector and matrix types. Utilities for converting `DataFrame` columns from `spark.mllib.linalg` to `spark.ml.linalg` types (and vice versa) can be found in `spark.mllib.util.MLUtils`. +There are also utility methods available for converting single instances of +vectors and matrices. Use the `asML` method on a `mllib.linalg.Vector` / `mllib.linalg.Matrix` +for converting to `ml.linalg` types, and +`mllib.linalg.Vectors.fromML` / `mllib.linalg.Matrices.fromML` +for converting to `mllib.linalg` types. +
{% highlight scala %} import org.apache.spark.mllib.util.MLUtils -val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); -val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); +// convert DataFrame columns +val convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) +val convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) +// convert a single vector or matrix +val mlVec: org.apache.spark.ml.linalg.Vector = mllibVec.asML +val mlMat: org.apache.spark.ml.linalg.Matrix = mllibMat.asML {% endhighlight %} Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for further detail. @@ -146,8 +156,12 @@ Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib. import org.apache.spark.mllib.util.MLUtils; import org.apache.spark.sql.Dataset; +// convert DataFrame columns Dataset convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); Dataset convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); +// convert a single vector or matrix +org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML +org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML {% endhighlight %} Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail. @@ -158,6 +172,7 @@ Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils. {% highlight python %} from pyspark.mllib.util import MLUtils +# convert DataFrame columns convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) convertedMatrxDF = MLUtils.convertMatrixColumnsToML(matrixDF) {% endhighlight %} From c48b89418d6374407b0dd4cb33a716bbf0189bf5 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 30 Jun 2016 15:59:34 +0200 Subject: [PATCH 5/8] Add Python asML example code --- docs/mllib-guide.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 2832a43b6459..3b99ae00330a 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -174,7 +174,10 @@ from pyspark.mllib.util import MLUtils # convert DataFrame columns convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) -convertedMatrxDF = MLUtils.convertMatrixColumnsToML(matrixDF) +convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) +// convert a single vector or matrix +mlVec = mllibVec.asML() +mlMat = mllibMat.asML() {% endhighlight %} Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for further detail. From 91fd24cca2d1a5bb6be79b68d0d7091f64ca4e97 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 30 Jun 2016 16:00:03 +0200 Subject: [PATCH 6/8] fix comment --- docs/mllib-guide.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 3b99ae00330a..dcbfe11438fa 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -175,7 +175,7 @@ from pyspark.mllib.util import MLUtils # convert DataFrame columns convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF) convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF) -// convert a single vector or matrix +# convert a single vector or matrix mlVec = mllibVec.asML() mlMat = mllibMat.asML() {% endhighlight %} From c2ce7cd9659484cec325ed137271c9b6ed52923d Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 30 Jun 2016 16:00:33 +0200 Subject: [PATCH 7/8] Fix java ; --- docs/mllib-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index dcbfe11438fa..bf46fd4f03e6 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -160,8 +160,8 @@ import org.apache.spark.sql.Dataset; Dataset convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); Dataset convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); // convert a single vector or matrix -org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML -org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML +org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML; +org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML; {% endhighlight %} Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail. From 919bfe9c73fca485ff33a528feffa5b59b0b7e86 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Thu, 30 Jun 2016 16:02:46 +0200 Subject: [PATCH 8/8] Fix Java method invoc --- docs/mllib-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index bf46fd4f03e6..17fd3e1edf4b 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -160,8 +160,8 @@ import org.apache.spark.sql.Dataset; Dataset convertedVecDF = MLUtils.convertVectorColumnsToML(vecDF); Dataset convertedMatrixDF = MLUtils.convertMatrixColumnsToML(matrixDF); // convert a single vector or matrix -org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML; -org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML; +org.apache.spark.ml.linalg.Vector mlVec = mllibVec.asML(); +org.apache.spark.ml.linalg.Matrix mlMat = mllibMat.asML(); {% endhighlight %} Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for further detail.