Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 37 additions & 19 deletions docs/ml-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ The primary Machine Learning API for Spark is now the [DataFrame](sql-programmin
* MLlib will still support the RDD-based API in `spark.mllib` with bug fixes.
* MLlib will not add new features to the RDD-based API.
* In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
* After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
* After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
* The RDD-based API is expected to be removed in Spark 3.0.

*Why is MLlib switching to the DataFrame-based API?*
Expand Down Expand Up @@ -66,41 +66,59 @@ To use MLlib in Python, you will need [NumPy](http://www.numpy.org) version 1.4
[^1]: To learn more about the benefits and background of system optimised natives, you may wish to
watch Sam Halliday's ScalaX talk on [High Performance Linear Algebra in Scala](http://fommil.github.io/scalax14/#/).

# Highlights in 2.2

The list below highlights some of the new features and enhancements added to MLlib in the `2.2`
release of Spark:

* `ALS` methods for _top-k_ recommendations for all users or items, matching the functionality
in `mllib` ([SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)). Performance
was also improved for both `ml` and `mllib`
([SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968) and
[SPARK-20587](https://issues.apache.org/jira/browse/SPARK-20587))
* `Correlation` and `ChiSquareTest` stats functions for `DataFrames`
([SPARK-19636](https://issues.apache.org/jira/browse/SPARK-19636) and
[SPARK-19635](https://issues.apache.org/jira/browse/SPARK-19635))
* `FPGrowth` algorithm for frequent pattern mining
([SPARK-14503](https://issues.apache.org/jira/browse/SPARK-14503))
* `GLM` now supports the full `Tweedie` family
([SPARK-18929](https://issues.apache.org/jira/browse/SPARK-18929))
* `Imputer` feature transformer to impute missing values in a dataset
([SPARK-13568](https://issues.apache.org/jira/browse/SPARK-13568))
* `LinearSVC` for linear Support Vector Machine classification
([SPARK-14709](https://issues.apache.org/jira/browse/SPARK-14709))
* Logistic regression now supports constraints on the coefficients during training
([SPARK-20047](https://issues.apache.org/jira/browse/SPARK-20047))

# Migration guide

MLlib is under active development.
The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
and the migration guide below will explain all changes between releases.

## From 2.0 to 2.1
## From 2.1 to 2.2

### Breaking changes

**Deprecated methods removed**

* `setLabelCol` in `feature.ChiSqSelectorModel`
* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
* `model` in `regression.LinearRegressionSummary`
* `validateParams` in `PipelineStage`
* `validateParams` in `Evaluator`
There are no breaking changes.

### Deprecations and changes of behavior

**Deprecations**

* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`
There are no deprecations.

**Changes of behavior**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include #17233 in this section?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - didn't catch that one


* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.

* [SPARK-19787](https://issues.apache.org/jira/browse/SPARK-19787):
Default value of `regParam` changed from `1.0` to `0.1` for `ALS.train` method (marked `DeveloperApi`).
**Note** this does _not affect_ the `ALS` Estimator or Model, nor MLlib's `ALS` class.
* [SPARK-14772](https://issues.apache.org/jira/browse/SPARK-14772):
Fixed inconsistency between Python and Scala APIs for `Param.copy` method.
* [SPARK-11569](https://issues.apache.org/jira/browse/SPARK-11569):
`StringIndexer` now handles `NULL` values in the same way as unseen values. Previously an exception
would always be thrown regardless of the setting of the `handleInvalid` parameter.

## Previous Spark versions

Earlier migration guides are archived [on this page](ml-migration-guides.html).
Expand Down
29 changes: 29 additions & 0 deletions docs/ml-migration-guides.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,35 @@ description: MLlib migration guides from before Spark SPARK_VERSION_SHORT

The migration guide for the current Spark version is kept on the [MLlib Guide main page](ml-guide.html#migration-guide).

## From 2.0 to 2.1

### Breaking changes

**Deprecated methods removed**

* `setLabelCol` in `feature.ChiSqSelectorModel`
* `numTrees` in `classification.RandomForestClassificationModel` (This now refers to the Param called `numTrees`)
* `numTrees` in `regression.RandomForestRegressionModel` (This now refers to the Param called `numTrees`)
* `model` in `regression.LinearRegressionSummary`
* `validateParams` in `PipelineStage`
* `validateParams` in `Evaluator`

### Deprecations and changes of behavior

**Deprecations**

* [SPARK-18592](https://issues.apache.org/jira/browse/SPARK-18592):
Deprecate all Param setter methods except for input/output column Params for `DecisionTreeClassificationModel`, `GBTClassificationModel`, `RandomForestClassificationModel`, `DecisionTreeRegressionModel`, `GBTRegressionModel` and `RandomForestRegressionModel`

**Changes of behavior**

* [SPARK-17870](https://issues.apache.org/jira/browse/SPARK-17870):
Fix a bug of `ChiSqSelector` which will likely change its result. Now `ChiSquareSelector` use pValue rather than raw statistic to select a fixed number of top features.
* [SPARK-3261](https://issues.apache.org/jira/browse/SPARK-3261):
`KMeans` returns potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
* [SPARK-17389](https://issues.apache.org/jira/browse/SPARK-17389):
`KMeans` reduces the default number of steps from 5 to 2 for the k-means|| initialization mode.

## From 1.6 to 2.0

### Breaking changes
Expand Down