At this stage, you should have completed the MoVRs installation steps documented in the INSTALL document. Here we explain basic use of MoVRs with the test (toy) data provided in the test directory. For research applications, please see our publication.
Finding motifs in DNA sequences breaks down to a few fundamental concepts:
- define regions of interest (roi), i.e. genomic sequences that may contain specific DNA motifs to be discovered
- define background sequences, i.e. genomic (or "random") sequences not expected to contain the specific DNA motifs
- identify over-represented patterns in roi versus background as candidate (biologically meaningful) motifs
There are nice programs available to do just that. MoVRs is built around the HOMER software. You can easily adopt MoVRs to work with other programs - after all, MoVRs is just a Linux bash script wrapped around existing motif-finding software, with a few useful additions. The goal of MoVRs is to reduce the set of candidate motifs reported by a genome-wide application of a program like HOMER to a largely non-overlapping and statistically cross-validated set of motifs. Compared to a one-pass genome-wide application of DNA motif-finding, MoVRs implements a several step workflow. Justification, details, and research examples are provided in our publication. This document reviews the mechanics of running MoVRs.
To get going, simply type MoVRs on your commandline, which should result in the following:
MoVRs
Usage: ./MoVRs <input with either -a, -f, or -i> <genome and background settings> [options]
Examples:
./MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 >& errTEST1
./MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 --minpresence 7 --startfromstep step5 >& errTEST1AGAIN
./MoVRs -a testpeakfile -G hg19 -o TEST2 --size [-60,40] -k 5 -p 10 >& errTEST2
./MoVRs -f testfastafile -b Background/testbackground -S 10 -k 3 -p 9 --outputdir TEST3 >& errTEST3
./MoVRs -i testidfile -r human -G hg19 --size [-200,100] -S 15 -k 4 -p 8 --outputdir TEST4 >& errTEST4
MoVRs will determine candidate motifs in a several-steps workflow. You can specify
partial workflow execution by specifying "stepX" [X = 1, 2, ..., 7] as argument to
--startfromstep, --stopatstep, or --runonlystep.
Step 1 Setting up training and validation sets
Step 2 HOMER de novo motif finding in training sets
Step 3 Motif extraction, filtering, and comparison
Step 4 Generating motif clusters
Step 5 Derivation of consensus motifs
Step 6 MoVRs motif presentation and annotation
Step 7 Find MoVRs consensus motifs in regions of interest
To see more help, type "./MoVRs -h" or "./MoVRs --help".
MoVRs -h
MoVRs --help
Following the hint to see more help would produce the additional messages
You must specify input in one of three forms:
1a) -a/-g or -a/-G combination (peak or BED file and genome file or identifier)
1b) -f/-b combination (FASTA target and background files)
1c) -i/-r/-G combination (GeneID file and preprocessed promoter set as well as genome identifier)
Other options default to the specified values if not set.
Details:
1a) peak or BED file input to HOMER:
-a|--annotation <peak|BED file> Input file specifying genomic regions of interest
-b|--background <peak|BED file> (Non-mandatory) file specifying background regions
-g|--genome <path> Path to chromosome files
-G|--Genome <identifier> Identifier of HOMER-preprocessed genome
1b) FASTA file input to HOMER:
-f|--fasta <path> FASTA-formatted input file with regions of interest
-b|--background <path> (Mandatory) FASTA-formatted file with background regions
1c) Gene identifier and promoter input to HOMER:
-i|--geneID <file> List file of gene identifiers
-r|--promoter <file> (Mandatory) Corresponding HOMER-supported promoter set identifier
[Choice: human, mouse, rat, fly, worm, zebrafish, or yeast]
2) Window and motif length and number input to HOMER
-s|--size <string> HOMER size argument (<#> or <[#,#]> or "given") [Default: 200]
-l|--length <string> HOMER motif length argument (<#> or <#>,<#>,...) [Default: 8,10,12]
-S|--nummotifs <#> HOMER argument for the number of motifs of each length to find) [Default: 25]
3) MoVRs-specific options:
-k <#> Conduct <#>-fold crossv-validation [Default: 10]
-p|--numproc <#> Use <#> processors during execution [Default: 1]; ideally a multipe of the -k argument.
-m|--mmquality <1e-#> Minimum motif quality for motif to be considered [Default: 1e-3]
-t|--ttthreshold <1e-#> Threshold for tomtom motif similarity [Default: 1e-3]
--minpresence <#> Minimal number of training sets in which a MoVRs motif must occur [Default: -k argument minus 1]
-o|--outputdir <path> Put output into directory <path> [Default: ./]
-c|--configfile <path> Configuration file [Default: /home/vbrendel/gitwd/MoVRs/scripts/MoVRs.conf]
4) MoVRs workflow settings; <step> below must be one of (step1, step2, ..., step7)
--startfromstep <string> Starting step; previous steps must have run successfully before.
--stopatstep <string> Last step to execute
--runonlystep <string> Workflow step to execute; previous steps must have run successfully before.
5) Else:
-h|--help Show this usage information
MoVRs takes roi and background in the three flavors supported by HOMER:
- Peak or BED file (specifying the roi in terms of genomic ranges)
- FASTA file (directly supplying the sequences of the roi)
- geneID file (specifying the roi with Gene Identifiers)
Please refer to the extensive and excellent HOMER documentation to review formats and specifics.
MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 >& errTEST1
takes the testpeakfile and TestGenome files as input and finds motifs in the range -60 to +40 relative to the annotated peaks. The program will use 10 processors.
MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 --minpresence 7 --startfromstep step5 >& errTEST1AGAIN
reruns the previous example, but with criterion that only 7 (instead of the default 9) training sets generated in the 10-fold cross-validation must contain a motif to be considered a validated candidate motif. The first four steps of the workflow are not rerun.
MoVRs -a testpeakfile -G hg19 -o TEST2 --size [-60,40] -k 5 -p 10 >& errTEST2
runs data corresponding to the preloaded human genome hg19. The -k argument specifies 5-fold cross-validation.
MoVRs -f testfastafile -b Background/testbackground -S 10 -k 3 -p 9 --outputdir TEST3 >& errTEST3
takes the specified FASTA-formatted roi and background sequences and restrictsyy HOMER](http://homer.salk.edu/homer/) to report only the best 10 motifs in each run.
MoVRs -i testidfile -r human -G hg19 --size [-200,100] -S 15 -k 4 -p 8 --outputdir TEST4 >& errTEST4
looks for motifs in the -200 to 100 range in the promoters of the human genes specified in the testidfile file.
A great way of learning what the MoVRs workflow entails is to run an example in stepwise fashion. Just add the option --runonlystep step1 to your favorite example. That will stop the workflow after the first step (summarized below). Look at the program logfiles and output, follow up on the program documentation, and take a mental snapshot of what this step accomplished. Then replace --runonlystep step1 by --runonlystep step2 and continue in similar fashion until the final step.
This step will create the training and validation sets in the specified output directory, subdirectories tmpTrainingDir and tmpValidationDir.
Step 2: HOMER de novo motif finding in training sets
This step will run the appropriate HOMER motif finder on each of the training sets. Records of this step are in tmpTraingDir, and final output is deposited in the output subdirectory tmpMotifDir.
Run in the tmpMotifDir, this step processes the motifs produced by the HOMER run. Motifs exceeding a quality threshold specified by option -m are pairwise compared using the MEME suite tomtom tool, and labeled similar if exceeding the threshold specified by option -t. Results are deposited into output subdirectory tmpMotifDir/TOMTOMresults.
This step invokes the MoVRs_GetCluster.py script that generates motif clusters based on the tomtom similarity values obtained in the previous step. Results are deposited into output subdirectory tmpMotifDir/MCLUSTERresults.
This step invokes the MoVRs_MotifSetReduce.pl script to generate consensus motifs for each motif cluster obtained in the previous step. The consensus motifs are recorded in files tmpMotifDir/MCLUSTERresults/mcluster*.cmotif.
In this step, the mcluster*.cmotif files are further processed to show seqLogos and similarity with known motifs. Results are deposited into output subdirectory MoVRs_OutputDir.
In the final step, the appropriate HOMER tools are used to tabulate occurrences of the MoVRs consensus motifs in the roi. See files MoVRs_OutputDir/mcluster*.tab.
We are still working on nice summaries of results. Until then, you are on your own ...