Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/ml-features.md
Original file line number Diff line number Diff line change
Expand Up @@ -1426,9 +1426,9 @@ categorical features. ChiSqSelector uses the
features to choose. It supports five selection methods: `numTopFeatures`, `percentile`, `fpr`, `fdr`, `fwe`:
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.
By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
The user can choose a selection method using `setSelectorType`.

Expand Down
4 changes: 2 additions & 2 deletions docs/mllib-feature-extraction.md
Original file line number Diff line number Diff line change
Expand Up @@ -231,9 +231,9 @@ features to choose. It supports five selection methods: `numTopFeatures`, `perce

* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
* `fpr` chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.

By default, the selection method is `numTopFeatures`, with the default number of top features set to 50.
The user can choose a selection method using `setSelectorType`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,13 +143,13 @@ private[feature] trait ChiSqSelectorParams extends Params
* `fdr`, `fwe`.
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
* - `fpr` chooses all features whose p-value are below a threshold, thus controlling the false
* positive rate of selection.
* - `fdr` uses the [Benjamini-Hochberg procedure]
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
* to choose all features whose false discovery rate is below a threshold.
* - `fwe` chooses all features whose p-values is below a threshold,
* thus controlling the family-wise error rate of selection.
* - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
* 1/numFeatures, thus controlling the family-wise error rate of selection.
* By default, the selection method is `numTopFeatures`, with the default number of top features
* set to 50.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -175,13 +175,13 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {
* `fdr`, `fwe`.
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
* - `fpr` chooses all features whose p-values are below a threshold, thus controlling the false
* positive rate of selection.
* - `fdr` uses the [Benjamini-Hochberg procedure]
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
* to choose all features whose false discovery rate is below a threshold.
* - `fwe` chooses all features whose p-values is below a threshold,
* thus controlling the family-wise error rate of selection.
* - `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
* 1/numFeatures, thus controlling the family-wise error rate of selection.
* By default, the selection method is `numTopFeatures`, with the default number of top features
* set to 50.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,22 +35,77 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext

// Toy dataset, including the top feature for a chi-squared test.
// These data are chosen such that each feature's test has a distinct p-value.
/* To verify the results with R, run:
library(stats)
x1 <- c(8.0, 0.0, 0.0, 7.0, 8.0)
x2 <- c(7.0, 9.0, 9.0, 9.0, 7.0)
x3 <- c(0.0, 6.0, 8.0, 5.0, 3.0)
y <- c(0.0, 1.0, 1.0, 2.0, 2.0)
chisq.test(x1,y)
chisq.test(x2,y)
chisq.test(x3,y)
/*
* Contingency tables
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of nice seeing these tables, but for validation, it's much easier for users to check using R. Could you please include a section for verifying the results with R, like there was before?

* feature1 = {6.0, 0.0, 8.0}
* class 0 1 2
* 6.0||1|0|0|
* 0.0||0|3|0|
* 8.0||0|0|2|
* degree of freedom = 4, statistic = 12, pValue = 0.017
*
* feature2 = {7.0, 9.0}
* class 0 1 2
* 7.0||1|0|0|
* 9.0||0|3|2|
* degree of freedom = 2, statistic = 6, pValue = 0.049
*
* feature3 = {0.0, 6.0, 3.0, 8.0}
* class 0 1 2
* 0.0||1|0|0|
* 6.0||0|1|2|
* 3.0||0|1|0|
* 8.0||0|1|0|
* degree of freedom = 6, statistic = 8.66, pValue = 0.193
*
* feature4 = {7.0, 0.0, 5.0, 4.0}
* class 0 1 2
* 7.0||1|0|0|
* 0.0||0|2|0|
* 5.0||0|1|1|
* 4.0||0|0|1|
* degree of freedom = 6, statistic = 9.5, pValue = 0.147
*
* feature5 = {6.0, 5.0, 4.0, 0.0}
* class 0 1 2
* 6.0||1|1|0|
* 5.0||0|2|0|
* 4.0||0|0|1|
* 0.0||0|0|1|
* degree of freedom = 6, statistic = 8.0, pValue = 0.238
*
* feature6 = {0.0, 9.0, 5.0, 4.0}
* class 0 1 2
* 0.0||1|0|1|
* 9.0||0|1|0|
* 5.0||0|1|0|
* 4.0||0|1|1|
* degree of freedom = 6, statistic = 5, pValue = 0.54
*
* To verify the results with R, run:
* library(stats)
* x1 <- c(6.0, 0.0, 0.0, 0.0, 8.0, 8.0)
* x2 <- c(7.0, 9.0, 9.0, 9.0, 9.0, 9.0)
* x3 <- c(0.0, 6.0, 3.0, 8.0, 6.0, 6.0)
* x4 <- c(7.0, 0.0, 0.0, 5.0, 5.0, 4.0)
* x5 <- c(6.0, 5.0, 5.0, 6.0, 4.0, 0.0)
* x6 <- c(0.0, 9.0, 5.0, 4.0, 4.0, 0.0)
* y <- c(0.0, 1.0, 1.0, 1.0, 2.0, 2.0)
* chisq.test(x1,y)
* chisq.test(x2,y)
* chisq.test(x3,y)
* chisq.test(x4,y)
* chisq.test(x5,y)
* chisq.test(x6,y)
*/

dataset = spark.createDataFrame(Seq(
(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0))), Vectors.dense(8.0)),
(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0))), Vectors.dense(0.0)),
(1.0, Vectors.dense(Array(0.0, 9.0, 8.0)), Vectors.dense(0.0)),
(2.0, Vectors.dense(Array(7.0, 9.0, 5.0)), Vectors.dense(7.0)),
(2.0, Vectors.dense(Array(8.0, 7.0, 3.0)), Vectors.dense(8.0))
(0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0))), Vectors.dense(6.0)),
(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0))), Vectors.dense(0.0)),
(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0))), Vectors.dense(0.0)),
(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0)), Vectors.dense(0.0)),
(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0)), Vectors.dense(8.0)),
(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)), Vectors.dense(8.0))
)).toDF("label", "features", "topFeature")
}

Expand All @@ -69,19 +124,25 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext

test("Test Chi-Square selector: percentile") {
val selector = new ChiSqSelector()
.setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.34)
.setOutputCol("filtered").setSelectorType("percentile").setPercentile(0.17)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

test("Test Chi-Square selector: fpr") {
val selector = new ChiSqSelector()
.setOutputCol("filtered").setSelectorType("fpr").setFpr(0.2)
.setOutputCol("filtered").setSelectorType("fpr").setFpr(0.02)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

test("Test Chi-Square selector: fdr") {
val selector = new ChiSqSelector()
.setOutputCol("filtered").setSelectorType("fdr").setFdr(0.12)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

test("Test Chi-Square selector: fwe") {
val selector = new ChiSqSelector()
.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.12)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

Expand Down
9 changes: 5 additions & 4 deletions python/pyspark/ml/feature.py
Original file line number Diff line number Diff line change
Expand Up @@ -2629,7 +2629,8 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
"""
.. note:: Experimental

Creates a ChiSquared feature selector.
Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`.

Expand All @@ -2638,15 +2639,15 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
* `percentile` is similar but chooses a fraction of all features
instead of a fixed number.

* `fpr` chooses all features whose p-value is below a threshold,
* `fpr` chooses all features whose p-values are below a threshold,
thus controlling the false positive rate of selection.

* `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
to choose all features whose false discovery rate is below a threshold.

* `fwe` chooses all features whose p-values is below a threshold,
thus controlling the family-wise error rate of selection.
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
1/numFeatures, thus controlling the family-wise error rate of selection.

By default, the selection method is `numTopFeatures`, with the default number of top features
set to 50.
Expand Down
6 changes: 3 additions & 3 deletions python/pyspark/mllib/feature.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,15 +282,15 @@ class ChiSqSelector(object):
* `percentile` is similar but chooses a fraction of all features
instead of a fixed number.

* `fpr` chooses all features whose p-value is below a threshold,
* `fpr` chooses all features whose p-values are below a threshold,
thus controlling the false positive rate of selection.

* `fdr` uses the `Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/
False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
to choose all features whose false discovery rate is below a threshold.

* `fwe` chooses all features whose p-values is below a threshold,
thus controlling the family-wise error rate of selection.
* `fwe` chooses all features whose p-values are below a threshold. The threshold is scaled by
1/numFeatures, thus controlling the family-wise error rate of selection.

By default, the selection method is `numTopFeatures`, with the default number of top features
set to 50.
Expand Down