Bayesian Markov Model motif discovery software.
To compile from source you need
To plot BaMM logos you need
- R 2.14.1 or later
mkdir build
cd build
cmake ..
make
OS X ships clang instead of gcc. We recommend using Homebrew to install gcc.
Having installed Homebrew, all required dependencies can be installed using the brew
command
brew tap homebrew/versions
brew tap homebrew/science
brew install gcc5 cmake R
Finally this will compile BaMM!motif
export CXX=g++-5
export CC=gcc-5
export LDFLAGS="-static-libgcc -static-libstdc++"
mkdir build
cd build
cmake ..
make
SYNOPSIS
BaMMmotif DIRPATH FILEPATH [OPTIONS]
DESCRIPTION
Bayesian Markov Model motif discovery software.
DIRPATH
Output directory for the results.
FILEPATH
FASTA file with positive sequences of equal length.
OPTIONS
Sequence options
--negSequenceSet <FILEPATH>
FASTA file with negative/background sequences used to learn the
(homogeneous) background BaMM. If not specified, the background BaMM
is learned from the positive sequences.
--reverseComp
Search motifs on both strands (positive sequences and reverse
complements). This option is e.g. recommended when using sequences
derived from ChIP-seq experiments.
Options to initialize a single BaMM from file
--bindingSiteFile <FILEPATH>
File with binding sites of equal length (one per line).
--markovModelFile <FILEPATH>
File with BaMM probabilities as obtained from BaMM!motif (omit
filename extension).
Options to initialize one or more BaMMs from XXmotif PWMs
--minPWMs <INTEGER>
Minimum number of PWMs. The options --maxPValue and --minOccurrence
are ignored. The default is 1.
--maxPWMs <INTEGER>
Maximum number of PWMs.
--maxPValue <FLOAT>
Maximum p-value of PWMs. This filter is not applied to the top
minimum number of PWMs (see --minPWMs). The default is 1.0.
--minOccurrence <FLOAT>
Minimum fraction of sequences that contain the motif. This filter is
not applied to the top minimum number of PWMs (see --minPWMs). The
default is 0.05.
--rankPWMs <INTEGER> [<INTEGER>...]
PWM ranks in XXmotif results. The former options to initialize BaMMs
from PWMs are ignored.
Options for (inhomogeneous) motif BaMMs
-k <INTEGER>
Order. The default is 2.
-a|--alpha <FLOAT> [<FLOAT>...]
Order-specific prior strength. The default is 1.0 (for k = 0) and
20 x 3^(k-1) (for k > 0). The options -b and -g are ignored.
-b|--beta <FLOAT>
Calculate order-specific alphas according to beta x gamma^(k-1) (for
k > 0). The default is 20.0.
-g|--gamma <FLOAT>
Calculate order-specific alphas according to beta x gamma^(k-1) (for
k > 0). The default is 3.0.
--extend <INTEGER>{1,2}
Extend BaMMs by adding uniformly initialized positions to the left
and/or right of initial BaMMs. Invoking e.g. with --extend 0 2 adds
two positions to the right of initial BaMMs. Invoking with --extend 2
adds two positions to both sides of initial BaMMs. By default, BaMMs
are not being extended.
Options for the (homogeneous) background BaMM
-K <INTEGER>
Order. The default is 2.
-A|--Alpha <FLOAT>
Prior strength. The default is 10.0.
EM options
-q <FLOAT>
Prior probability for a positive sequence to contain a motif. The
default is 0.9.
-e|--epsilon <FLOAT>
The EM algorithm is deemed to be converged when the sum over the
absolute differences in BaMM probabilities from successive EM rounds
is smaller than epsilon. The default is 0.001.
XXmotif options
--XX-ZOOPS
Use the zero-or-one-occurrence-per-sequence model (default).
--XX-MOPS
Use the multiple-occurrence-per-sequence model.
--XX-OOPS
Use the one-occurrence-per-sequence model.
--XX-seeds ALL|FIVEMERS|PALINDROME|TANDEM|NOPALINDROME|NOTANDEM
Define the nature of seed patterns. The default is to start using ALL
seed pattern variants.
--XX-gaps 0|1|2|3
Maximum number of gaps used for seed patterns. The default is 0.
--XX-pseudoCounts <FLOAT>
Percentage of pseudocounts. The default is 10.0.
--XX-mergeMotifsThreshold LOW|MEDIUM|HIGH
Define the similarity threshold used to merge PWMs. The default is to
merge PWMs with LOW similarity in order to reduce runtime.
--XX-maxPositions <INTEGER>
Limit the number of motif positions to reduce runtime. The default is
17.
--XX-noLengthOptimPWMs
Omit the length optimization of PWMs.
--XX-K <INTEGER>
Order of the (homogeneous) background BaMM. The default is either 2
(when learned on positive sequences) or 8 (when learned on background
sequences).
--XX-A <FLOAT>
Prior strength of the (homogeneous) background BaMM. The default is
10.0.
--XX-jumpStartPatternStage <STRING>
Jump-start pattern stage using an IUPAC pattern string.
--XX-jumpStartPWMStage <FILEPATH>
Jump-start PWM stage reading in a PWM from file.
--XX-localization
Calculate p-values for positional clustering of motif occurrences in
positive sequences of equal length. Improves the sensitivity to find
weak but positioned motifs.
--XX-localizationRanking
Rank motifs according to localization statistics.
--XX-downstreamPositions <INTEGER>
Distance between the anchor position (e.g. the transcription start
site) and the last positive sequence nucleotide. Corrects motif
positions in result plots. The default is 0.
--XX-batch
Suppress progress bars.
Options to score sequences
--scorePosSequenceSet
Score positive (training) sequences with optimized BaMMs.
--scoreNegSequenceSet
Score background (training) sequences with optimized BaMMs.
--scoreTestSequenceSet <FILEPATH> [<FILEPATH>...]
Score test sequences with optimized BaMMs. Test sequences can be
provided in a single or multiple FASTA files.
Output options
--saveInitBaMMs
Write initialized BaMM(s) to disk.
--saveBaMMs
Write optimized BaMM(s) to disk.
--verbose
Verbose terminal printouts.
-h, --help
Printout this help.
BaMMs are written to flat file when invoking BaMM!motif with the output option --saveInitBaMMs
and/or --saveBaMMs
. In this case, BaMM!motif generates three files for each (inhomogeneous) BaMM – one containing the probabilities (filename extension: probs), one containing the conditional probabilities (filename extension: conds), and one containing the background frequencies of mononucleotides in the positive sequences (file extension: freqs). The format is the same for the first two. While blank lines separate BaMM positions, lines 1 to k+1 of each BaMM position contain the (conditional) probabilities for order 0 to order k. For instance, the format for a BaMM of order 2 and length W is as follows:
Filename extension: probs
P1(A) P1(C) P1(G) P1(T)
P1(AA) P1(AC) P1(AG) P1(AT) P1(CA) P1(CC) P1(CG) ... P1(TT)
P1(AAA) P1(AAC) P1(AAG) P1(AAT) P1(ACA) P1(ACC) P1(ACG) ... P1(TTT)
P2(A) P2(C) P2(G) P2(T)
P2(AA) P2(AC) P2(AG) P2(AT) P2(CA) P2(CC) P2CG) ... P2(TT)
P2(AAA) P2(AAC) P2(AAG) P2(AAT) P2(ACA) P2(ACC) P2(ACG) ... P2(TTT)
...
PW(A) PW(C) PW(G) PW(T)
PW(AA) PW(AC) PW(AG) PW(AT) PW(CA) PW(CC) PWCG) ... PW(TT)
PW(AAA) PW(AAC) PW(AAG) PW(AAT) PW(ACA) PW(ACC) PW(ACG) ... PW(TTT)
Filename extension: conds
P1(A) P1(C) P1(G) P1(T)
P1(A|A) P1(C|A) P1(G|A) P1(T|A) P1(A|C) P1(C|C) P1(G|C) ... P1(T|T)
P1(A|AA) P1(C|AA) P1(G|AA) P1(T|AA) P1(A|AC) P1(C|AC) P1(G|AC) ... P1(T|TT)
P2(A) P2(C) P2(G) P2(T)
P2(A|A) P2(C|A) P2(G|A) P2(T|A) P2(A|C) P2(C|C) P2(G|C) ... P2(T|T)
P2(A|AA) P2(C|AA) P2(G|AA) P2(T|AA) P2(A|AC) P2(C|AC) P2(G|AC) ... P2(T|TT)
...
PW(A) PW(C) PW(G) PW(T)
PW(A|A) PW(C|A) PW(G|A) PW(T|A) PW(A|C) PW(C|C) PW(G|C) ... PW(T|T)
PW(A|AA) PW(C|AA) PW(G|AA) PW(T|AA) PW(A|AC) PW(C|AC) PW(G|AC) ... PW(T|TT)
Filename extension: freqs
P(A) P(C) P(G) P(T)
Note that contexts are restricted to the binding site. For instance, P1(G|AC) and P2(G|AC) are defined as P1(G) and P2(G|C), respectively.
In addition, BaMM!motif generates three files for the (homogeneous) background BaMM – one containing the probabilities (filename extension: probsBg), one containing the conditional probabilities (filename extension: condsBg), and one containing the background frequencies of mononucleotides (file extension: freqs). For instance, the format for a background BaMM of order 2 is as follows:
Filename extension: probsBg
P(A) P(C) P(G) P(T)
P(AA) P(AC) P(AG) P(AT) P(CA) P(CC) P(CG) ... P(TT)
P(AAA) P(AAC) P(AAG) P(AAT) P(ACA) P(ACC) P(ACG) ... P(TTT)
Filename extension: condsBg
P(A) P(C) P(G) P(T)
P(A|A) P(C|A) P(G|A) P(T|A) P(A|C) P(C|C) P(G|C) ... P(T|T)
P(A|AA) P(C|AA) P(G|AA) P(T|AA) P(A|AC) P(C|AC) P(G|AC) ... P(T|TT)
Filename extension: freqsBg
P(A) P(C) P(G) P(T)
Note that the background frequencies of mononucleotides are identical to the probabilities of mononucleotides in the other two files.
R scripts are provided in directory R to plot the BaMM logo from a BaMM flat file. To create a BaMM logo, edit the parameter setting in plotBaMM.wrapper.R
and source the code in the R session using
source( "plotBaMM.wrapper.R" )
Please find comments on available plotting options in the wrapper.
BaMM!motif is released under the GNU General Public License v3 or later. See LICENSE for more details.