-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
2c07179
27aead9
9c7fae3
2e97c55
e141c68
d05d7de
f4a0a14
d51b78b
92530ab
2625208
83a429e
5a7cc2c
aa5f2cc
da6ac35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -171,11 +171,17 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] { | |
|
|
||
| /** | ||
| * Creates a ChiSquared feature selector. | ||
| * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. | ||
| * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, | ||
| * `fdr`, `fwe`. | ||
| * - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. | ||
| * - `percentile` is similar but chooses a fraction of all features instead of a fixed number. | ||
| * - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false | ||
| * positive rate of selection. | ||
| * - `fdr` uses the [Benjamini-Hochberg procedure] | ||
| * (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) | ||
| * to choose all features whose false discovery rate is below a threshold. | ||
| * - `fwe` chooses all features whose p-values is below a threshold, | ||
| * thus controlling the family-wise error rate of selection. | ||
| * By default, the selection method is `numTopFeatures`, with the default number of top features | ||
| * set to 50. | ||
| */ | ||
|
|
@@ -184,6 +190,8 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |
| var numTopFeatures: Int = 50 | ||
| var percentile: Double = 0.1 | ||
| var fpr: Double = 0.05 | ||
| var fdr: Double = 0.05 | ||
| var fwe: Double = 0.05 | ||
| var selectorType = ChiSqSelector.NumTopFeatures | ||
|
|
||
| /** | ||
|
|
@@ -215,6 +223,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |
| this | ||
| } | ||
|
|
||
| @Since("2.2.0") | ||
| def setFdr(value: Double): this.type = { | ||
| require(0.0 <= value && value <= 1.0, "FDR must be in [0,1]") | ||
| fdr = value | ||
| this | ||
| } | ||
|
|
||
| @Since("2.2.0") | ||
| def setFwe(value: Double): this.type = { | ||
| require(0.0 <= value && value <= 1.0, "FWE must be in [0,1]") | ||
| fwe = value | ||
| this | ||
| } | ||
|
|
||
| @Since("2.1.0") | ||
| def setSelectorType(value: String): this.type = { | ||
| require(ChiSqSelector.supportedSelectorTypes.contains(value), | ||
|
|
@@ -245,6 +267,21 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |
| case ChiSqSelector.FPR => | ||
| chiSqTestResult | ||
| .filter { case (res, _) => res.pValue < fpr } | ||
| case ChiSqSelector.FDR => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add docs to clarify
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Irrelevent to this PR, we can eliminate
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's definitely used -- you have to keep the original index in order to pass them to the model. |
||
| // This uses the Benjamini-Hochberg procedure. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add link to explain |
||
| // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure | ||
| val tempRes = chiSqTestResult | ||
| .sortBy { case (res, _) => res.pValue } | ||
| val maxIndex = tempRes | ||
| .zipWithIndex | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This zipWithIndex is however not correct it seems.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will validate the results carefully, can compare the results with sklearn.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have added a large size data sample in the test Suite, and updated the Contingency tables in the test Suite comments. The value of degree of freedom, statistic and pValue for each feature is also added. so it is easy to validate the result. SelectFDR in sklearn is not exact. My PR to fix the bug is merged today. |
||
| .filter { case ((res, _), index) => | ||
| res.pValue <= fdr * (index + 1) / chiSqTestResult.length } | ||
| .map { case (_, index) => index } | ||
| .max | ||
| tempRes.take(maxIndex + 1) | ||
| case ChiSqSelector.FWE => | ||
| chiSqTestResult | ||
| .filter { case (res, _) => res.pValue < fwe / chiSqTestResult.length } | ||
| case errorType => | ||
| throw new IllegalStateException(s"Unknown ChiSqSelector Type: $errorType") | ||
| } | ||
|
|
@@ -255,19 +292,22 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |
|
|
||
| private[spark] object ChiSqSelector { | ||
|
|
||
| /** | ||
| * String name for `numTopFeatures` selector type. | ||
| */ | ||
| val NumTopFeatures: String = "numTopFeatures" | ||
| /** String name for `numTopFeatures` selector type. */ | ||
| private[spark] val NumTopFeatures: String = "numTopFeatures" | ||
|
|
||
| /** | ||
| * String name for `percentile` selector type. | ||
| */ | ||
| val Percentile: String = "percentile" | ||
| /** String name for `percentile` selector type. */ | ||
| private[spark] val Percentile: String = "percentile" | ||
|
|
||
| /** String name for `fpr` selector type. */ | ||
| val FPR: String = "fpr" | ||
| private[spark] val FPR: String = "fpr" | ||
|
|
||
| /** String name for `fdr` selector type. */ | ||
| private[spark] val FDR: String = "fdr" | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this applies to the existing line above too, but, this comment isn't descriptive. You can spell out what all of these mean if there's javadoc at all here.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about "Selector type name of False Discovery Rate, which chooses all features whose false discovery rate meets some threshold. " |
||
|
|
||
| /** String name for `fwe` selector type. */ | ||
| private[spark] val FWE: String = "fwe" | ||
|
|
||
|
|
||
| /** Set of selector types that ChiSqSelector supports. */ | ||
| val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR) | ||
| val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR, FDR, FWE) | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -79,6 +79,12 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext | |
| ChiSqSelectorSuite.testSelector(selector, dataset) | ||
| } | ||
|
|
||
| test("Test Chi-Square selector: fwe") { | ||
| val selector = new ChiSqSelector() | ||
| .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6) | ||
| ChiSqSelectorSuite.testSelector(selector, dataset) | ||
| } | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a simple test for
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Pinging on @yanboliang 's comment about adding a test for FDR |
||
| test("read/write") { | ||
| def checkModelData(model: ChiSqSelectorModel, model2: ChiSqSelectorModel): Unit = { | ||
| assert(model.selectedFeatures === model2.selectedFeatures) | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc sounds the same as fpr. Could you state that the threshold is scaled by 1/numFeatures to clarify?
Also fix: "p-values is" -> "p-values are"