Skip to content

Machine learning framework to quantify pathogenicity of structural variants

License

Notifications You must be signed in to change notification settings

gersteinlab/SVFX

Repository files navigation

SVFX: Using Random Forests to Prioritize Structural Variants

SVFX is a machine learning based tool to assign pathogenic scores to large deletions and duplications.

Setup

Install conda using these directions.

On MacOS, you can activate the provided environment with conda env create -f SVFXenv.yml && conda activate SVFXenv

otherwise, run conda env create -f SVFXenv_all_platforms.yml && conda activate SVFXenv

The SVFX pipeline requires Python3.6+.

Configuring the pipeline

The pipeline is built using Snakemake; to configure elements of the pipeline, modify the config.yaml file from the root directory.

Configurable elements

run_id: A string to identify this run; this string will be prepended to the files generated by this run so that multiple runs can be executed without cleanup.

coordinates: control and disease should have the file paths for the control and disease coordinates, respectively.

bigwig_files: A hyphenated list of paths to BigWig files to be used in feature computation.

overlap_coordinate_files: A hyphenated list of paths to coordinate files for computing overlap as a feature (e.g. coordinates of certain genes)

normalized: Boolean flag for whether to length-normalize the feature values.

variant_type: What kind of variants are in the input files (either DEL or DUP)

ref_genome: Index file (*.fai) for a reference genome

randomized_num: If normalized is True, the number of pseudo-matrices of variants to normalize with.

output_matrix: control and disease should have the root output names for the control feature matrix and the disease feature matrix, respectively (these will be combined into a file with the root name combined_matrix).

columns_to_remove: File with indices for the columns in the feature matrix that should not be considered in the model (e.g. the indices of the feature coordinates). The default file should be fine for the matrices generated from the scripts we provide.

class_label_index: Column with the class label. Again, the default value should be fine for matrices generated from our scripts.

num_trees: Number of trees in the random forest

max_depth: The maximum depth of a tree in the random forest

min_samples_split: The minimum number of samples to split an internal node in a tree in the random forest

Running the entire pipeline

Simply run snakemake --cores=8 from the root directory.

Running parts of the pipeline

To just run a part of the pipeline, modify config.yaml and Snakefile with the configuration and input/output you want, then run snakemake --cores=8 -R --until (rule name).

For example, if you have a list of coordinates and just want to generate the feature matrix for it (which is useful if you want to use a preexisting model to score additional variants), you can modify Snakefile to set the control or disease input to your file (for the other input, you can just put an arbitrary coordinates file, then delete the output matrix corresponding to that input). You can then run snakemake --cores=8 -R --until generate_feature_matrix, and the matrix will be generated.

Using procedures outside of the pipeline

Usage of the individual programs in the pipeline is explained below.

necessary python package installations:

pip3 install scikit-allel pip3 install pyBigWig pip3 install argparse pip3 install sklearn

generate_feature_matrix.py

Usage: python3 src/generate_feature_matrix.py -c [SV files in tab-separated coordinate format (each line should start with chr)] -b [Feature files in BigWig format] -g [Coordinates for interval overlap-based features] -o [File root for output] -f [Present if class label 1, False otherwise] -t [SV type (e.g. DEL, DUP, INV)] -z [Present if matrix should be Z-Score normalized (length feature will not be normalized)] -r [Number of shuffled coordinates to generate for the randomization process] -rg [Reference genome index file - necessary for the -r flag] -l [If present, length will be included as a feature (and will be ignored during normalization)]

Sample command (run from the root directory): python3 src/generate_feature_matrix.py -c data/sample_input_SV.txt -b data/gcPercentage.bw -g data/gc19_pc.prom.nr.bed data/gc19_pc.3utr.nr.bed data/gc19_pc.5utr.nr.bed data/gc19_pc.cds.nr.bed data/gc19_pc.ss.nr.bed data/sensitive.nc.bed data/ultra.conserved.hg19.bed data/wgEncodeBroadHmmGm12878HMM.Heterochrom.bed data/H1-ESC_Dixon2015-raw_TADs.bed -o sample_output_matrix -t DEL -z -r 1 -rg data/hg19.fa.fai

Given an input list of structural variants, generates a matrix of feature values calculated from the given BigWig and coordinate-overlap files for input to a model.

rf_model.py

Usage: python3 src/rf_model.py -i [input feature matrix] -d [File of indices of features to ignore (one per line)] -t [Index of the class label in the feature matrix] -n [Number of trees in the forest] -m [Maximum depth of a tree in the forest] -s [Minimum number of samples required to split an internal node] -c [Number of pathogenic SV's in the matrix] -l [Total number of SV's in the matrix] -o [Output root filename]

Sample command: python3 src/rf_model.py -i input_del_matrix.ZscoreNormalized.training.tsv -d remove_indices.txt -t 0 -n 150 -c 3613 -l 7225 -o output_model_

Given an input feature matrix, trains a set of 10 random forest classifiers on disjoint tenths of the dataset, then predicts class labels for each member of the dataset with the models that did not train on them. Saves the created models, as a Python list of trained sklearn models, to [file root]_ten_models.pkl. Saves the indices used to split the diseased data into tenths in [file root]_disease_indices.pkl, and those used to split the control data in [file root]_kg_indices.pkl.

Predictions on the training data (from the models that did not train on each specific SV) are outputted in dictionary format to [file root]_predictions.pkl. Finally, AU(ROC/PRC) on this data are calculated and outputted to [file root]_ROC_PRC.txt.

Note: The input matrix must be formatted in such a way that all control group rows follow all non-control rows (in other words, all rows with label 0 follow all rows with label 1). The easiest way to do this is to use generate_feature_matrix.py to generate the 0-label and 1-label matrices, then append the 0-label matrix (without title row) to the 1-label matrix.

src/load_model.py

Tests the model on a matrix. Note that the matrix must have the same format as the matrices used to train the model (every single feature in the same order as the training data, with no additional features). To generate a matrix given coordinates, see (here)[###running-parts-of-the-pipeline].

You will need to pip3 install joblib before this can be run. If any dependencies are still missing, install them with pip3.

Usage: python3 src/load_model.py -i [input matrix] -d [File of columns to ignore for model training; the same as columns_to_remove in the config file. remove_inds.txt should be find for matrices generated from our scripts.] -m [the .pkl file with all 10 models. This is output by the Snakemake pipeline] -t [Index of the class label in the matrix - use 0 if the matrix was generated using our code] -o [filename to write the predictions to]

roc_prc_gen.py

Usage: python3 roc_prc_gen.py [input file] [output file]

Given a tab-separated two-column input file of predicted scores and true labels (with a header, as with the following example):

Predicted True 0.755 1 ... 0.211 0

Outputs AU(ROC/PRC) information, along with a plot of the ROC, saved to the specified output file.

About

Machine learning framework to quantify pathogenicity of structural variants

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages