Skip to content

Conversation

@mpjlu
Copy link

@mpjlu mpjlu commented Sep 23, 2016

What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@mpjlu mpjlu changed the title [MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) [SPARK-17645][MLLIB][ML][WIP]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) Sep 23, 2016
@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65817 has finished for PR 15212 at commit 2c07179.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65972 has finished for PR 15212 at commit 27aead9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • // ('scaled' = +Infinity). However in the case that this class also has
    • // 0 probability, the class will not be selected ('scaled' is NaN).
    • final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ >= 0) && t.count(_ == 0) <= 1)
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)
    • case class SortOrder(child: Expression, direction: SortDirection, nullOrdering: NullOrdering)
    • trait Offset extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65974 has finished for PR 15212 at commit 9c7fae3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Sep 27, 2016

hi @srowen @yanboliang , I have updated this PR. Thanks

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65975 has finished for PR 15212 at commit 2e97c55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

@mpjlu I made some changes to improve ChiSqSelector performance at #15277. Let work together to get that in first, and then we can work on this. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66633 has finished for PR 15212 at commit e141c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66634 has finished for PR 15212 at commit d05d7de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67045 has finished for PR 15212 at commit f4a0a14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Oct 17, 2016

Hi @yanboliang @srowen , this is the last two feature selection methods based on ChiSquare, which is similar to the method in scikit learn. But there is a bug about SelectFDR in scikit learn. I have submit a PR to scikit learn: scikit-learn/scikit-learn#7490.
Thanks very much.

case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docs to clarify This uses the Benjamini-Hochberg procedure.

case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrelevent to this PR, we can eliminate zipWithIndex at L235, since no one use it. Would you mind to do this clean up in your PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely used -- you have to keep the original index in order to pass them to the model.

.zipWithIndex
.filter { case ((res, _), index) =>
res.pValue <= alpha * (index + 1) / chiSqTestResult.length }
.map { case (_, index) => index}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index} to index }

chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
val tempRDD = chiSqTestResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use another name, since it's not an RDD.

/**
* The highest p-value for features to be kept.
* Only applicable when selectorType = "fpr".
* alpha means the highest p-value for features to be kept when select type is "fpr".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep Only applicable when selectorType = "fpr", "fdr" or "fwe"., since it's not applicable to other selector types such as "kbest" and "percentile".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

* The highest p-value for features to be kept.
* Only applicable when selectorType = "fpr".
* alpha means the highest p-value for features to be kept when select type is "fpr".
* alpha means the highest uncorrected p-value for features to be kept when select type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, or the highest ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

final val alpha = new DoubleParam(this, "alpha", "The highest p-value for features to be kept.",
final val alpha = new DoubleParam(this, "alpha",
"alpha means the highest p-value for features to be kept when select type is fpr, " +
"alpha means the highest uncorrected p-value for features to be kept when select type " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, or the highest ...

.select("filtered", "preFilteredData").collect().foreach {
case Row(vec1: Vector, vec2: Vector) =>
assert(vec1 ~== vec2 absTol 1e-1)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforce this test case to large dimension data, and output different selected features according to selector type as far as possible.


alpha = Param(Params._dummy(), "alpha", "The highest p-value for features to be kept.",
alpha = Param(Params._dummy(), "alpha", "alpha means the highest p-value for features " +
"to be kept when select type is fpr, alpha means the highest uncorrected " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

def setAlpha(self, value):
"""
Sets the value of :py:attr:`alpha`.
Only applicable when selectorType = "fpr".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

* `kbest` chooses the `k` top features according to a chi-squared test.
* `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose false positive rate meets some threshold.
* `fpr` select features based on a false positive rate test.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has two lines for fpr. The existing text is more descriptive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

/** @group setParam */
@Since("2.1.0")
def setAlpha(value: Double): this.type = set(alpha, value)
def setAlpha(value: Double): this.type = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change back, thanks

case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely used -- you have to keep the original index in order to pass them to the model.

val tempRDD = chiSqTestResult
.sortBy { case (res, _) => res.pValue }
val maxIndex = tempRDD
.zipWithIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zipWithIndex is however not correct it seems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will validate the results carefully, can compare the results with sklearn.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a large size data sample in the test Suite, and updated the Contingency tables in the test Suite comments. The value of degree of freedom, statistic and pValue for each feature is also added. so it is easy to validate the result.

SelectFDR in sklearn is not exact. My PR to fix the bug is merged today.

private[spark] val FPR: String = "fpr"

/** String name for `fdr` selector type. */
private[spark] val FDR: String = "fdr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this applies to the existing line above too, but, this comment isn't descriptive. You can spell out what all of these mean if there's javadoc at all here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Selector type name of False Discovery Rate, which chooses all features whose false discovery rate meets some threshold. "

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67249 has finished for PR 15212 at commit d51b78b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67261 has finished for PR 15212 at commit 92530ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu mpjlu changed the title [SPARK-17645][MLLIB][ML][WIP]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) [SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) Oct 20, 2016
@mpjlu
Copy link
Author

mpjlu commented Oct 24, 2016

Hi @yanboliang and @srowen , could you please review whether this PR includes all your comments. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68999 has finished for PR 15212 at commit 2625208.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Nov 22, 2016

hi @yanboliang , @srowen @jkbradley , I have updated this PR, thanks.

@yanboliang
Copy link
Contributor

@mpjlu Sorry for late response, I just finished QA work for 2.1 and start ordinary review. Could you resolve the merge conflicts first? I will take a look tomorrow. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70454 has finished for PR 15212 at commit 83a429e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some docs need to be improved, otherwise, looks good. Thanks.

* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` chooses all features whose false discovery rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold should be better?

* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` chooses all features whose false discovery rate meets some threshold.
* `fwe` chooses all features whose family-wise error rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whose p-values is below a threshold, thus controlling the family-wise error rate of selection

* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
* `fdr` chooses all features whose false discovery rate meets some threshold.
* `fwe` chooses all features whose family-wise error rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update according the above suggestion.

@Since("2.1.0")
def getFpr: Double = $(fpr)

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc is incorrect even it's consistent with sklearn, actually we don't compare fdr value with p-value directly. I'm more prefer to change as The upper bound of the expected false discovery rate which is more accuracy and easy to understand.

* Only applicable when selectorType = "fdr".
* Default value is 0.05.
* @group param
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update version to 2.2.0.

* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
* positive rate of selection.
* - `fdr` chooses all features whose false discovery rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

chiSqTestResult
.filter { case (res, _) => res.pValue < fpr }
case ChiSqSelector.FDR =>
// This uses the Benjamini-Hochberg procedure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* String name for `numTopFeatures` selector type.
*/
/** String name for `numTopFeatures` selector type. */
val NumTopFeatures: String = "numTopFeatures"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark]

* String name for `percentile` selector type.
*/
/** String name for `percentile` selector type. */
val Percentile: String = "percentile"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark]


test("ChiSqSelector transform test (sparse & dense vector)") {
test("ChiSqSelector transform by KBest test (sparse & dense vector)") {
val labeledDiscreteData = sc.parallelize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many test functions need labeledDiscreteData, we can refactor it out of the function and make other functions shared the same dataset instance.

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70536 has finished for PR 15212 at commit 5a7cc2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some very minor comments.
cc @srowen for another pass if you have time. Thanks.

* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

* positive rate of selection.
* - `fdr` uses the [Benjamini-Hochberg procedure]
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
* to choose all features whose false discovery rate is below a threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

* - `fdr` uses the [Benjamini-Hochberg procedure]
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
* to choose all features whose false discovery rate is below a threshold.
* - `fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))),
LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 2)

test("ChiSqSelector transform by KBest test (sparse & dense vector)") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by numTopFeatures

`fdr` uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
`fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

`fdr` uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
`fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a simple test for fdr selector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging on @yanboliang 's comment about adding a test for FDR

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70549 has finished for PR 15212 at commit aa5f2cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val filteredData = labeledDiscreteData.map { lp =>
LabeledPoint(lp.label, model.transform(lp.features))
}.collect().toSet
assert(filteredData == preFilteredData)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creates a ChiSquared feature selector.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`.
`numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organize following items to make it as a list in the generated API docs, please refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1300

`percentile` is similar but chooses a fraction of all features instead of a fixed number.
`fpr` chooses all features whose p-value is below a threshold, thus controlling the false
positive rate of selection.
`fdr` uses the [Benjamini-Hochberg procedure]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_

This is the Python style of using doc links. You can refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1308

Creates a ChiSquared feature selector.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, organize following items to make it as a list in the generated API docs.

`percentile` is similar but chooses a fraction of all features instead of a fixed number.
`fpr` chooses all features whose p-value is below a threshold, thus controlling the false
positive rate of selection.
`fdr` uses the [Benjamini-Hochberg procedure]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70635 has finished for PR 15212 at commit da6ac35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Merged into master, thanks!

@asfgit asfgit closed this in 79ff853 Dec 28, 2016
Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpjlu @yanboliang Thanks for the PR! I know this is late, but I had a couple of small comments. Could you please send a little follow-up PR against the same JIRA?

* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc sounds the same as fpr. Could you state that the threshold is scaled by 1/numFeatures to clarify?

Also fix: "p-values is" -> "p-values are"

.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging on @yanboliang 's comment about adding a test for FDR

Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
Creates a ChiSquared feature selector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back in the fact that this selects categorical features for predicting a categorical label. It's useful to state expected input data types.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 29, 2016
…Discovery Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes apache#15212 from mpjlu/fdr_fwe.
@mpjlu
Copy link
Author

mpjlu commented Dec 29, 2016

Thanks @jkbradley , I will send a follow-up PR for your comments.

@mpjlu
Copy link
Author

mpjlu commented Dec 29, 2016

hi @jkbradley @yanboliang , I have created a follow up PR for this PR. #16434
I have not added FDR test in ML Suite. The main reason is the current data set is not a good case for FDR test (you can only select 0 feature or all 3 features by FDR). Do you think I should write a new data set like in MLLIB to test FDR in ML.

@yanboliang
Copy link
Contributor

@mpjlu Thanks for the follow-up PR. @jkbradley Please feel free to shepherd that PR, since I'm on travel these days. Thanks.

@jkbradley
Copy link
Member

Will do!

asfgit pushed a commit that referenced this pull request Jan 10, 2017
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #16434 from mpjlu/fdr_fwe_update.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…Discovery Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes apache#15212 from mpjlu/fdr_fwe.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for apache#15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes apache#16434 from mpjlu/fdr_fwe_update.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for apache#15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes apache#16434 from mpjlu/fdr_fwe_update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants