You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently random_forest_model takes the covariates argument but it is ignored.
The argument should be a list of expression of type float64 (same as logistic_regression_rows) and these expressions should be evaluated and included as continuous variables for random forest model building along the ordered factor genotypic variables.
Because we want to keep the variant importance output table as is (that is indexed by locus and list of alleles) we cannot include the importances of covariates in it. Instead we should a method covariates_importance() on RandomForestModel that returns a table with the following schema:
'covariate': str - the name of the covariate (as in the input expression)
'importance':float64 - the importance
'splitCount':int64 - the split count
indexed with the 'covariate'.
So for example we can have a phenotype file in csv format like this (this is the modified hipster_lableles.txt):
covariate importance splitCount
age 10.0 200
PC0 1.0 4
...
Things to consider:
in the first instance all covariates can be treated as continuous variables but in the future we could try to use more fine-grained mapping of the hail schema type to VariantSpark feature types which would for example allow to include some covariates as OrderedFactor or Factor(not implemented yet) variables.
Implementation notes
The examples in 'python` directory and possibly a notebook example should be created to demonstrate this new functionality.
The python example can be based on the examples/local_run-importance-ch22_with_pheno.sh. The transposed version of the data/chr22_1000_pheno-wide.csv may need to be created to support this (that should also include the classification response variable).
For the notebook datasets a more biologically relevant dataset should be used that possibly includes principal component analysis factors. Maybe hipster index dataset can be adapted for this (maybe with PC factors or some other random covariates that do not need to associated with the response)
The text was updated successfully, but these errors were encountered:
Hi @lm-fsng what do you think about the spec above. In particular how would you see getting the importances of the covariates back (and is this even necessary)
Hi @lm-fsng I have included the changes we discussed. Please confirm that you are happy with the current spec.
And also please follow up with Rob on whether the covariates need to be included in p-value caluclation.
Hi @piotrszul and @rocreguant , just spoke to Rob about the covariates and the p-value calculation. He said that the continuous covariates would be systematically biased to be more important, which would 'push' the genotypes down on the variable importance scale. This would result in less significant SNPs if we include the covariates in the p-value calculation. The workaround that is to ignore the covariates and just use the variable importance of the SNPs in the p-value method.
Implement covariates in vs hail interface function
varspark.hail.random_forest_model
with the interface analogous to hail'slogistic_regression_rows()
methos (see: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows)Currently
random_forest_model
takes thecovariates
argument but it is ignored.The argument should be a list of expression of type float64 (same as logistic_regression_rows) and these expressions should be evaluated and included as continuous variables for random forest model building along the ordered factor genotypic variables.
Because we want to keep the variant importance output table as is (that is indexed by locus and list of alleles) we cannot include the importances of covariates in it. Instead we should a method
covariates_importance()
onRandomForestModel
that returns a table with the following schema:indexed with the 'covariate'.
So for example we can have a phenotype file in csv format like this (this is the modified hipster_lableles.txt):
Then the code with the use of covariates
age
,PC0
andPC1
should look like this:Output:
Things to consider:
Implementation notes
The examples in 'python` directory and possibly a notebook example should be created to demonstrate this new functionality.
The python example can be based on the
examples/local_run-importance-ch22_with_pheno.sh
. The transposed version of thedata/chr22_1000_pheno-wide.csv
may need to be created to support this (that should also include the classification response variable).For the notebook datasets a more biologically relevant dataset should be used that possibly includes principal component analysis factors. Maybe hipster index dataset can be adapted for this (maybe with PC factors or some other random covariates that do not need to associated with the response)
The text was updated successfully, but these errors were encountered: