Skip to content

Latest commit

 

History

History
201 lines (166 loc) · 10.5 KB

HOWTO.md

File metadata and controls

201 lines (166 loc) · 10.5 KB

MoVRs HOWTO - examples for how to use the software

Preparation

At this stage, you should have completed the MoVRs installation steps documented in the INSTALL document. Here we explain basic use of MoVRs with the test (toy) data provided in the test directory. For research applications, please see our publication.

Overview

Finding motifs in DNA sequences breaks down to a few fundamental concepts:

  • define regions of interest (roi), i.e. genomic sequences that may contain specific DNA motifs to be discovered
  • define background sequences, i.e. genomic (or "random") sequences not expected to contain the specific DNA motifs
  • identify over-represented patterns in roi versus background as candidate (biologically meaningful) motifs

There are nice programs available to do just that. MoVRs is built around the HOMER software. You can easily adopt MoVRs to work with other programs - after all, MoVRs is just a Linux bash script wrapped around existing motif-finding software, with a few useful additions. The goal of MoVRs is to reduce the set of candidate motifs reported by a genome-wide application of a program like HOMER to a largely non-overlapping and statistically cross-validated set of motifs. Compared to a one-pass genome-wide application of DNA motif-finding, MoVRs implements a several step workflow. Justification, details, and research examples are provided in our publication. This document reviews the mechanics of running MoVRs.

Input

To get going, simply type MoVRs on your commandline, which should result in the following:

MoVRs

    Usage: ./MoVRs <input with either -a, -f, or -i> <genome and background settings> [options]

    Examples:
      ./MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 >& errTEST1
      ./MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 --minpresence 7 --startfromstep step5 >& errTEST1AGAIN
      ./MoVRs -a testpeakfile -G hg19 -o TEST2 --size [-60,40] -k 5 -p 10 >& errTEST2
      ./MoVRs -f testfastafile -b Background/testbackground -S 10 -k 3 -p 9 --outputdir TEST3 >& errTEST3
      ./MoVRs -i testidfile -r human -G hg19 --size [-200,100] -S 15 -k 4 -p 8 --outputdir TEST4 >& errTEST4

    MoVRs will determine candidate motifs in a several-steps workflow.  You can specify
    partial workflow execution by specifying "stepX" [X = 1, 2, ..., 7] as argument to
    --startfromstep, --stopatstep, or --runonlystep.

    Step 1  Setting up training and validation sets
    Step 2  HOMER de novo motif finding in training sets
    Step 3  Motif extraction, filtering, and comparison
    Step 4  Generating motif clusters
    Step 5  Derivation of consensus motifs
    Step 6  MoVRs motif presentation and annotation
    Step 7  Find MoVRs consensus motifs in regions of interest

    To see more help, type "./MoVRs -h" or "./MoVRs --help".


MoVRs -h
MoVRs --help

Following the hint to see more help would produce the additional messages

      You must specify input in one of three forms:
        1a) -a/-g or -a/-G combination    (peak or BED file and genome file or identifier)
        1b) -f/-b combination             (FASTA target and background files)
        1c) -i/-r/-G combination          (GeneID file and preprocessed promoter set as well as genome identifier)

      Other options default to the specified values if not set.
      Details:

      1a) peak or BED file input to HOMER:
        -a|--annotation <peak|BED file>   Input file specifying genomic regions of interest
        -b|--background <peak|BED file>   (Non-mandatory) file specifying background regions
        -g|--genome <path>                Path to chromosome files
        -G|--Genome <identifier>          Identifier of HOMER-preprocessed genome
      1b) FASTA file input to HOMER:
        -f|--fasta <path>                 FASTA-formatted input file with regions of interest
        -b|--background <path>            (Mandatory) FASTA-formatted file with background regions
      1c) Gene identifier and promoter input to HOMER:
        -i|--geneID <file>                List file of gene identifiers
        -r|--promoter <file>              (Mandatory) Corresponding HOMER-supported promoter set identifier
                                            [Choice: human, mouse, rat, fly, worm, zebrafish, or yeast]
      2) Window and motif length and number input to HOMER
        -s|--size <string>                HOMER size argument (<#> or <[#,#]> or "given") [Default: 200]
        -l|--length <string>              HOMER motif length argument (<#> or <#>,<#>,...) [Default: 8,10,12]
        -S|--nummotifs <#>                HOMER argument for the number of motifs of each length to find) [Default: 25]
      3) MoVRs-specific options:
        -k <#>                            Conduct <#>-fold crossv-validation [Default: 10]
        -p|--numproc <#>                  Use <#> processors during execution [Default: 1]; ideally a multipe of the -k argument.
        -m|--mmquality <1e-#>             Minimum motif quality for motif to be considered [Default: 1e-3]
        -t|--ttthreshold <1e-#>           Threshold for tomtom motif similarity [Default: 1e-3]
        --minpresence <#>                 Minimal number of training sets in which a MoVRs motif must occur [Default: -k argument minus 1]
        -o|--outputdir <path>             Put output into directory <path> [Default: ./]
        -c|--configfile <path>            Configuration file [Default: /home/vbrendel/gitwd/MoVRs/scripts/MoVRs.conf]
      4) MoVRs workflow settings; <step> below must be one of (step1, step2, ..., step7)
         --startfromstep <string>	  Starting step; previous steps must have run successfully before.
         --stopatstep <string>            Last step to execute
         --runonlystep <string>           Workflow step to execute; previous steps must have run successfully before.
      5) Else:
        -h|--help                         Show this usage information

MoVRs takes roi and background in the three flavors supported by HOMER:

  • Peak or BED file (specifying the roi in terms of genomic ranges)
  • FASTA file (directly supplying the sequences of the roi)
  • geneID file (specifying the roi with Gene Identifiers)

Please refer to the extensive and excellent HOMER documentation to review formats and specifics.

Sample MoVRs invocations

MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 >& errTEST1

takes the testpeakfile and TestGenome files as input and finds motifs in the range -60 to +40 relative to the annotated peaks. The program will use 10 processors.

MoVRs -a testpeakfile --genome ./TestGenome -o TEST1 --size [-60,40] -p 10 --minpresence 7 --startfromstep step5 >& errTEST1AGAIN

reruns the previous example, but with criterion that only 7 (instead of the default 9) training sets generated in the 10-fold cross-validation must contain a motif to be considered a validated candidate motif. The first four steps of the workflow are not rerun.

MoVRs -a testpeakfile -G hg19 -o TEST2 --size [-60,40] -k 5 -p 10 >& errTEST2

runs data corresponding to the preloaded human genome hg19. The -k argument specifies 5-fold cross-validation.

MoVRs -f testfastafile -b Background/testbackground -S 10 -k 3 -p 9 --outputdir TEST3 >& errTEST3

takes the specified FASTA-formatted roi and background sequences and restrictsyy HOMER](http://homer.salk.edu/homer/) to report only the best 10 motifs in each run.

MoVRs -i testidfile -r human -G hg19 --size [-200,100] -S 15 -k 4 -p 8 --outputdir TEST4 >& errTEST4

looks for motifs in the -200 to 100 range in the promoters of the human genes specified in the testidfile file.

Steps in the workflow

A great way of learning what the MoVRs workflow entails is to run an example in stepwise fashion. Just add the option --runonlystep step1 to your favorite example. That will stop the workflow after the first step (summarized below). Look at the program logfiles and output, follow up on the program documentation, and take a mental snapshot of what this step accomplished. Then replace --runonlystep step1 by --runonlystep step2 and continue in similar fashion until the final step.

Step 1: Setting up training and validation sets

This step will create the training and validation sets in the specified output directory, subdirectories tmpTrainingDir and tmpValidationDir.

Step 2: HOMER de novo motif finding in training sets

This step will run the appropriate HOMER motif finder on each of the training sets. Records of this step are in tmpTraingDir, and final output is deposited in the output subdirectory tmpMotifDir.

Step 3: Motif extraction, filtering, and comparison

Run in the tmpMotifDir, this step processes the motifs produced by the HOMER run. Motifs exceeding a quality threshold specified by option -m are pairwise compared using the MEME suite tomtom tool, and labeled similar if exceeding the threshold specified by option -t. Results are deposited into output subdirectory tmpMotifDir/TOMTOMresults.

Step 4: Generating motif clusters

This step invokes the MoVRs_GetCluster.py script that generates motif clusters based on the tomtom similarity values obtained in the previous step. Results are deposited into output subdirectory tmpMotifDir/MCLUSTERresults.

Step 5: Derivation of consensus motifs

This step invokes the MoVRs_MotifSetReduce.pl script to generate consensus motifs for each motif cluster obtained in the previous step. The consensus motifs are recorded in files tmpMotifDir/MCLUSTERresults/mcluster*.cmotif.

Step 6: MoVRs motif presentation and annotation

In this step, the mcluster*.cmotif files are further processed to show seqLogos and similarity with known motifs. Results are deposited into output subdirectory MoVRs_OutputDir.

Step 7: Find MoVRs consensus motifs in regions of interest

In the final step, the appropriate HOMER tools are used to tabulate occurrences of the MoVRs consensus motifs in the roi. See files MoVRs_OutputDir/mcluster*.tab.

Output and examination of results

We are still working on nice summaries of results. Until then, you are on your own ...