Skip to content

Latest commit

 

History

History
66 lines (52 loc) · 3.28 KB

SVR_modelmaker_README.md

File metadata and controls

66 lines (52 loc) · 3.28 KB

Copyright (c) 2014 Joshua L. Schipper

Help Info

optional arguments: -h, --help show this help message and exit -i PBMFile The results of a custom PBM experiment with sequences centered by binding site (i.e. using PWM) -o OutFilePrefix Optional, the prefix that all output files will be based on (do not include file extension). See program notes for proper format of this file -g, --gridsearch Flag for running a grid search, if optimal cost and epsilon values are not known --seqlength SequenceLength Change the length of the PBM sequence (Default is 36, new sequence will remain centered according to original 36mer from PBM data) --feature FeatureType Define the type of features, i.e. 2 for 2mers, 123 for 1, 2, and 3-mers, etc; default is 3 for 3mers --extrafiles Print extra files: including all matrix files, feature definitions (sequence and position), and model sequence files -c SVR_cost The cost value input for LibSVM. If running a grid search, this should be a string of numbers in quotes, i.e. "0.05 0.1 0.5". -p SVR_epsilon The epsilon value input for LibSVM. If running a grid search, this should be a string of numbers in quotes, i.e. "0.001 0.01 0.1".

General Information and Usage

LibSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) is required.

This program has been written for use with python v2.7, and tested with LIBSVM version 3.1

The PBM file contains the intensities from a protein binding microarray

(PBM) experiment. This file should contain the following (tab separated) columns: Name, ID, Sequence, Orientation1, Orientation2, Best-orientation, Replicate_intensity_difference. Typically, the values in Orientation1, Orientation2, and Best-orientation, are natural log intensities, where the Best-orientation is simply the maximum of orientations. These values are normalized by the program to values between 0 and 1.

A grid search can be performed using the --gridscearch flag, if the optimum

epsilon and cost values are not already known. In the case of a gridsearch, the range of epsilon and cost values should be a set of numbers, separated by spaces, contained in quotes (i.e. -c "0.001 0.01 0.1"). For a grid search, LIBSVM will be run with 5-fold cross validation on the full set of sequences.

If a grid search is not being performed, the full set of sequences is split

into 5 sets of sequences (this can be altered), where 4 sets are combined to use for training, while the other is held back for testing the model, and this is repeated so that each set is held back once. The model generated by the best of the 5 runs is saved, while the results of the other runs are printed.

The file ending in .model can then be used by LIBSVM to predict the binding

scores for a set of sequences.