Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 23 additions & 15 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ We list major functionality from both below, with links to detailed guides.
* [Advanced topics](ml-advanced.html)

Some techniques are not available yet in spark.ml, most notably dimensionality reduction
Users can seemlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`.
Users can seamlessly combine the implementation of these techniques found in `spark.mllib` with the rest of the algorithms found in `spark.ml`.

# Dependencies

Expand All @@ -100,24 +100,32 @@ MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

## From 1.4 to 1.5
## From 1.5 to 1.6

In the `spark.mllib` package, there are no break API changes but several behavior changes:
There are no breaking API changes in the `spark.mllib` or `spark.ml` packages, but there are
deprecations and changes of behavior.

* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
`RegressionMetrics.explainedVariance` returns the average regression sum of squares.
* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
sorted.
* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.
Deprecations:

In the `spark.ml` package, there exists one break API change and one behavior change:
* [SPARK-11358](https://issues.apache.org/jira/browse/SPARK-11358):
In `spark.mllib.clustering.KMeans`, the `runs` parameter has been deprecated.
* [SPARK-10592](https://issues.apache.org/jira/browse/SPARK-10592):
In `spark.ml.classification.LogisticRegressionModel` and
`spark.ml.regression.LinearRegressionModel`, the `weights` field has been deprecated in favor of
the new name `coefficients`. This helps disambiguate from instance (row) "weights" given to
algorithms.

* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
from `Params.setDefault` due to a
[Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.
Changes of behavior:

* [SPARK-7770](https://issues.apache.org/jira/browse/SPARK-7770):
`spark.mllib.tree.GradientBoostedTrees`: `validationTol` has changed semantics in 1.6.
Previously, it was a threshold for absolute change in error. Now, it resembles the behavior of
`GradientDescent`'s `convergenceTol`: For large errors, it uses relative error (relative to the
previous error); for small errors (`< 0.01`), it uses absolute error.
* [SPARK-11069](https://issues.apache.org/jira/browse/SPARK-11069):
`spark.ml.feature.RegexTokenizer`: Previously, it did not convert strings to lowercase before
tokenizing. Now, it converts to lowercase by default, with an option not to. This matches the
behavior of the simpler `Tokenizer` transformer.

## Previous Spark versions

Expand Down
19 changes: 19 additions & 0 deletions docs/mllib-migration-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,25 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT

The migration guide for the current Spark version is kept on the [MLlib Programming Guide main page](mllib-guide.html#migration-guide).

## From 1.4 to 1.5

In the `spark.mllib` package, there are no breaking API changes but several behavior changes:

* [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005):
`RegressionMetrics.explainedVariance` returns the average regression sum of squares.
* [SPARK-8600](https://issues.apache.org/jira/browse/SPARK-8600): `NaiveBayesModel.labels` become
sorted.
* [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382): `GradientDescent` has a default
convergence tolerance `1e-3`, and hence iterations might end earlier than 1.4.

In the `spark.ml` package, there exists one breaking API change and one behavior change:

* [SPARK-9268](https://issues.apache.org/jira/browse/SPARK-9268): Java's varargs support is removed
from `Params.setDefault` due to a
[Scala compiler bug](https://issues.scala-lang.org/browse/SI-9013).
* [SPARK-10097](https://issues.apache.org/jira/browse/SPARK-10097): `Evaluator.isLargerBetter` is
added to indicate metric ordering. Metrics like RMSE no longer flip signs as in 1.4.

## From 1.3 to 1.4

In the `spark.mllib` package, there were several breaking changes, but all in `DeveloperApi` or `Experimental` APIs:
Expand Down