-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanups #8752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,7 +32,21 @@ See the [algorithm guides](#algorithm-guides) section below for guides on sub-pa | |
| * This will become a table of contents (this text will be scraped). | ||
| {:toc} | ||
|
|
||
| # Main concepts | ||
| # Algorithm guides | ||
|
|
||
| We provide several algorithm guides specific to the Pipelines API. | ||
| Several of these algorithms, such as certain feature transformers, are not in the `spark.mllib` API. | ||
| Also, some algorithms have additional capabilities in the `spark.ml` API; e.g., random forests | ||
| provide class probabilities, and linear models provide model summaries. | ||
|
|
||
| * [Feature extraction, transformation, and selection](ml-features.html) | ||
| * [Decision Trees for classification and regression](ml-decision-tree.html) | ||
| * [Ensembles](ml-ensembles.html) | ||
| * [Linear methods with elastic net regularization](ml-linear-methods.html) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if we should include model summaries in this description; I had a mailing list question about where that feature is documented
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, there's not a great place. I'll try sticking a note here. |
||
| * [Multilayer perceptron classifier](ml-ann.html) | ||
|
|
||
|
|
||
| # Main concepts in Pipelines | ||
|
|
||
| Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple | ||
| algorithms into a single pipeline, or workflow. | ||
|
|
@@ -166,6 +180,11 @@ compile-time type checking. | |
| `Pipeline`s and `PipelineModel`s instead do runtime checking before actually running the `Pipeline`. | ||
| This type checking is done using the `DataFrame` *schema*, a description of the data types of columns in the `DataFrame`. | ||
|
|
||
| *Unique Pipeline stages*: A `Pipeline`'s stages should be unique instances. E.g., the same instance | ||
| `myHashingTF` should not be inserted into the `Pipeline` twice since `Pipeline` stages must have | ||
| unique IDs. However, different instances `myHashingTF1` and `myHashingTF2` (both of type `HashingTF`) | ||
| can be put into the same `Pipeline` since different instances will be created with different IDs. | ||
|
|
||
| ## Parameters | ||
|
|
||
| Spark ML `Estimator`s and `Transformer`s use a uniform API for specifying parameters. | ||
|
|
@@ -184,16 +203,6 @@ Parameters belong to specific instances of `Estimator`s and `Transformer`s. | |
| For example, if we have two `LogisticRegression` instances `lr1` and `lr2`, then we can build a `ParamMap` with both `maxIter` parameters specified: `ParamMap(lr1.maxIter -> 10, lr2.maxIter -> 20)`. | ||
| This is useful if there are two algorithms with the `maxIter` parameter in a `Pipeline`. | ||
|
|
||
| # Algorithm guides | ||
|
|
||
| There are now several algorithms in the Pipelines API which are not in the `spark.mllib` API, so we link to documentation for them here. These algorithms are mostly feature transformers, which fit naturally into the `Transformer` abstraction in Pipelines, and ensembles, which fit naturally into the `Estimator` abstraction in the Pipelines. | ||
|
|
||
| * [Feature extraction, transformation, and selection](ml-features.html) | ||
| * [Decision Trees for classification and regression](ml-decision-tree.html) | ||
| * [Ensembles](ml-ensembles.html) | ||
| * [Linear methods with elastic net regularization](ml-linear-methods.html) | ||
| * [Multilayer perceptron classifier](ml-ann.html) | ||
|
|
||
| # Code examples | ||
|
|
||
| This section gives code examples illustrating the functionality discussed above. | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -380,35 +380,43 @@ data2 = labels.zip(normalizer2.transform(features)) | |
| </div> | ||
| </div> | ||
|
|
||
| ## Feature selection | ||
| [Feature selection](http://en.wikipedia.org/wiki/Feature_selection) allows selecting the most relevant features for use in model construction. Feature selection reduces the size of the vector space and, in turn, the complexity of any subsequent operation with vectors. The number of features to select can be tuned using a held-out validation set. | ||
| ## ChiSqSelector | ||
|
|
||
| ### ChiSqSelector | ||
| [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) stands for Chi-Squared feature selection. It operates on labeled data with categorical features. `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, and then filters (selects) the top features which the class label depends on the most. This is akin to yielding the features with the most predictive power. | ||
| [Feature selection](http://en.wikipedia.org/wiki/Feature_selection) tries to identify relevant | ||
| features for use in model construction. It reduces the size of the feature space, which can improve | ||
| both speed and statistical learning behavior. | ||
|
|
||
| #### Model Fitting | ||
| [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) implements | ||
| Chi-Squared feature selection. It operates on labeled data with categorical features. | ||
| `ChiSqSelector` orders features based on a Chi-Squared test of independence from the class, | ||
| and then filters (selects) the top features which the class label depends on the most. | ||
| This is akin to yielding the features with the most predictive power. | ||
|
|
||
| [`ChiSqSelector`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) has the | ||
| following parameters in the constructor: | ||
| The number of features to select can be tuned using a held-out validation set. | ||
|
|
||
| * `numTopFeatures` number of top features that the selector will select (filter). | ||
| ### Model Fitting | ||
|
|
||
| We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method in | ||
| `ChiSqSelector` which can take an input of `RDD[LabeledPoint]` with categorical features, learn the summary statistics, and then | ||
| return a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space. | ||
| `ChiSqSelector` takes a `numTopFeatures` parameter specifying the number of top features that | ||
| the selector will select. | ||
|
|
||
| This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) | ||
| which can apply the Chi-Squared feature selection on a `Vector` to produce a reduced `Vector` or on | ||
| The [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) method takes | ||
| an input of `RDD[LabeledPoint]` with categorical features, learns the summary statistics, and then | ||
| returns a `ChiSqSelectorModel` which can transform an input dataset into the reduced feature space. | ||
| The `ChiSqSelectorModel` can be applied either to a `Vector` to produce a reduced `Vector`, or to | ||
| an `RDD[Vector]` to produce a reduced `RDD[Vector]`. | ||
|
|
||
| Note that the user can also construct a `ChiSqSelectorModel` by hand by providing an array of selected feature indices (which must be sorted in ascending order). | ||
|
|
||
| #### Example | ||
| ### Example | ||
|
|
||
| The following example shows the basic use of ChiSqSelector. The data set used has a feature matrix consisting of greyscale values that vary from 0 to 255 for each feature. | ||
|
|
||
| <div class="codetabs"> | ||
| <div data-lang="scala"> | ||
| <div data-lang="scala" markdown="1"> | ||
|
|
||
| Refer to the [`ChiSqSelector` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ChiSqSelector) | ||
| for details on the API. | ||
|
|
||
| {% highlight scala %} | ||
| import org.apache.spark.SparkContext._ | ||
| import org.apache.spark.mllib.linalg.Vectors | ||
|
|
@@ -434,7 +442,11 @@ val filteredData = discretizedData.map { lp => | |
| {% endhighlight %} | ||
| </div> | ||
|
|
||
| <div data-lang="java"> | ||
| <div data-lang="java" markdown="1"> | ||
|
|
||
| Refer to the [`ChiSqSelector` Java docs](api/java/org/apache/spark/mllib/feature/ChiSqSelector.html) | ||
| for details on the API. | ||
|
|
||
| {% highlight java %} | ||
| import org.apache.spark.SparkConf; | ||
| import org.apache.spark.api.java.JavaRDD; | ||
|
|
@@ -486,7 +498,12 @@ sc.stop(); | |
|
|
||
| ## ElementwiseProduct | ||
|
|
||
| ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) between the input vector, `v` and transforming vector, `w`, to yield a result vector. | ||
| `ElementwiseProduct` multiplies each input vector by a provided "weight" vector, using element-wise | ||
| multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This | ||
| represents the [Hadamard product](https://en.wikipedia.org/wiki/Hadamard_product_%28matrices%29) | ||
| between the input vector, `v` and transforming vector, `scalingVec`, to yield a result vector. | ||
| Qu8T948*1# | ||
| Denoting the `scalingVec` as "`w`," this transformation may be written as: | ||
|
|
||
| `\[ \begin{pmatrix} | ||
| v_1 \\ | ||
|
|
@@ -506,7 +523,7 @@ v_N | |
|
|
||
| [`ElementwiseProduct`](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) has the following parameter in the constructor: | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not part of your PR, but this API doc link would be better if it were inside the codetabs |
||
|
|
||
| * `w`: the transforming vector. | ||
| * `scalingVec`: the transforming vector. | ||
|
|
||
| `ElementwiseProduct` implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer) which can apply the weighting on a `Vector` to produce a transformed `Vector` or on an `RDD[Vector]` to produce a transformed `RDD[Vector]`. | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The classname is
backtickedin ChiSqSelector but not here or in Binarizer, we should choose one and be consistent. I would vote for backticking everything since that's what I've been doing