Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make the covariate calculation part of RF #193

Closed
BauerLab opened this issue Feb 22, 2021 · 3 comments · Fixed by #209
Closed

Make the covariate calculation part of RF #193

BauerLab opened this issue Feb 22, 2021 · 3 comments · Fixed by #209
Assignees

Comments

@BauerLab
Copy link
Contributor

BauerLab commented Feb 22, 2021

Implement covariates in vs hail interface function varspark.hail.random_forest_model with the interface analogous to hail's logistic_regression_rows() methos (see: https://hail.is/docs/0.2/methods/stats.html#hail.methods.logistic_regression_rows)

Currently random_forest_model takes the covariates argument but it is ignored.

The argument should be a list of expression of type float64 (same as logistic_regression_rows) and these expressions should be evaluated and included as continuous variables for random forest model building along the ordered factor genotypic variables.

Because we want to keep the variant importance output table as is (that is indexed by locus and list of alleles) we cannot include the importances of covariates in it. Instead we should a method covariates_importance() on RandomForestModel that returns a table with the following schema:

  • 'covariate': str - the name of the covariate (as in the input expression)
  • 'importance':float64 - the importance
  • 'splitCount':int64 - the split count

indexed with the 'covariate'.

So for example we can have a phenotype file in csv format like this (this is the modified hipster_lableles.txt):

samples,score,label, age, PC0,PC1, PC2
HG00096,6,0,64,0.332,0.444,0.45
HG00097,15.5,1,32,0.444,0.334,0.128 
HG00099,8,0,17,0.983,0.344,0.4343
...

Then the code with the use of covariates age, PC0 and PC1 should look like this:

    data = hl.import_vcf(os.path.join(PROJECT_DIR, 'data/hipsterIndex/hipster.vcf.bgz'))
    labels = hl.import_table(os.path.join(PROJECT_DIR, 'data/hipsterIndex/hipster_labels.txt'),
                             delimiter=',',
                             types=dict(label='float64', score='float64', age='float64', PC0='float64',
                                 PC1='float64',, PC2='float64'
                             )).key_by('samples')

    mt = data.annotate_cols(hipster=labels[data.s])
    print(mt.count())

    with vshl.random_forest_model(y=mt.hipster.label,
                                  x=mt.GT.n_alt_alleles(), 
                                  covariates = [mt.hipster.age, mt.hipster.PC0, mt.hipster.PC1],
                                  seed=13, mtry_fraction=0.05,
                                  min_node_size=5, max_depth=10) as rf_model:
        rf_model.fit_trees(100, 50)

        ...
        rf_model.covriate_importance().show() # display the covariate importance table

Output:

covariate importance splitCount
age         10.0          200
PC0         1.0             4
...

Things to consider:

  • in the first instance all covariates can be treated as continuous variables but in the future we could try to use more fine-grained mapping of the hail schema type to VariantSpark feature types which would for example allow to include some covariates as OrderedFactor or Factor(not implemented yet) variables.

Implementation notes

The examples in 'python` directory and possibly a notebook example should be created to demonstrate this new functionality.

The python example can be based on the examples/local_run-importance-ch22_with_pheno.sh. The transposed version of the data/chr22_1000_pheno-wide.csv may need to be created to support this (that should also include the classification response variable).

For the notebook datasets a more biologically relevant dataset should be used that possibly includes principal component analysis factors. Maybe hipster index dataset can be adapted for this (maybe with PC factors or some other random covariates that do not need to associated with the response)

@piotrszul
Copy link
Collaborator

Hi @lm-fsng what do you think about the spec above. In particular how would you see getting the importances of the covariates back (and is this even necessary)

@piotrszul
Copy link
Collaborator

Hi @lm-fsng I have included the changes we discussed. Please confirm that you are happy with the current spec.
And also please follow up with Rob on whether the covariates need to be included in p-value caluclation.

@lm-fsng
Copy link
Collaborator

lm-fsng commented Feb 23, 2022

Hi @piotrszul and @rocreguant , just spoke to Rob about the covariates and the p-value calculation. He said that the continuous covariates would be systematically biased to be more important, which would 'push' the genotypes down on the variable importance scale. This would result in less significant SNPs if we include the covariates in the p-value calculation. The workaround that is to ignore the covariates and just use the variable importance of the SNPs in the p-value method.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants