GitHub - UTNK/PRESUME: preparation of many diversified Sequences by simulation of evolutionary process

PRESUME Installation and User Manual

Overview of PRESUME

PRESUME is a software tool that simulates cell division or speciation and diversification of DNA sequences in the growing population. By employing a distributed computing platform, PRESUME progressively generates a large number of sequences that accumulate substitutions with their lineage history information. In this process, daughter sequences are duplicated at a certain speed which is incompletely inherited from that of the maternal sequence under a stochastic model (Figure 1). The substitution probability at different positions in each sequence is defined in a time-dependent manner using GTR-Gamma model or set to a certain rate. The software allows the user to simulate various types of sequence diversification processes with different sets of input parameters.

Figure 1. Schematic diagram of PRESUME. PRESUME simulates the propagation and diversification of sequences that accumulate substitutions and generates a large set of descendant sequences with lineage information. m refers to a maternal sequence, and d₁ and d₂ refers to two daughter sequences derived from m. In this simulation, the doubling times of the two daughter sequences (t_d1 and t_d2) are incompletely inherited from the doubling time of the mother sequence (t_m). This occurs under a stochastic model, in which 1/t_d1 and 1/t_d2 follow a normal distribution where the mean and variance are 1/t_m and σ² respectively. Additionally, sequence extinction is set at a random rate (ε) and also occurs when the sequence doubling speed reaches a negative value. The substitution probabilities at different positions in each sequence of length L are defined in a time-dependent manner using GTR-Gamma model with parameters Q, α and μ, or set to a certain rate φ (see SubstitutionModelDetails.PRESUME.pdf).

Supported Environment

PRESUME can be executed on MacOS or Linux.
The distributed computing mode of PRESUME requires UGE (Univa Grid Engine)

Software Dependency

Python3 (newer than 3.7.0) with Biopython module required and tqdm module optional; if you want to visualize a simulation progress

Software Installation

Installation of PRESUME

Each step of installation takes less than 1 min.

Download PRESUME by

git clone https://github.com/yachielab/PRESUME

Add the absolute path of PRESUME directory to $PATH
Make PRESUME executable
```
chmod u+x PRESUME.py
```

Installation of Anaconda (required)

Execute the following commands

wget https://repo.anaconda.com/archive/Anaconda3-2018.12-Linux-x86_64.sh
bash Anaconda3-2018.12-Linux-x86_64.sh

Set $PATH to anaconda3/bin

Installation of Biopython 1.76 (required)

Install Biopython by
```
conda install -c anaconda biopython
```

Installation of tqdm 4.43.0 (optional)

Install tqdm by
```
conda install -c conda-forge tqdm
```

Sample Codes

The software functions can be tested by the following example commands:

Example 1

Generation of ~100 sequences using GTR-Gamma model with the default parameter set without distributed computing. The computation will take several minutes.

PRESUME.py -n 100 --gtrgamma default --save

Output: a directory PRESUMEout containing the following files will be created in your working directory:

PRESUMEout.fa : FASTA file for generated descendant sequences
root.fa : FASTA file describing the root sequence used for the simulation
PRESUMEout.nwk: Newick format file for the lineage history of the generated sequences
args.csv: CSV file containing basic patameters used for the simulation (enabled by --save).

Example 2

Generation of ~100 sequences using a time-independent model with the substitution frequency of 5% per site per generation along with a highly unbalanced lineage trajectory (σ of 10). The computation will take several minutes.

PRESUME.py -n 100 --constant 0.05 -s 10

Output data: a directory PRESUMEout will be created in your working directory.

Example 3

Generation of ~10,000 sequences using GTR-Gamma model with a defined parameter set with distributed computing. The computation will take several minutes.

PRESUME.py -n 10000 --gtrgamma GTR{0.927000/2.219783/1.575175/0.861651/4.748809/1.000000}+FU{0.298/0.215/0.304/0.183}+G{0.553549} --qsub

Output data: a directory PRESUMEout will be created in your working directory.

Example 4

Generation of ~100 sequences using GTR-Gamma model and an original indel model with a defined parameter set with distributed computing. The computation will take ~1 minute.

PRESUME.py -n 100 --gtrgamma default --inprob prob.txt --inlength length.txt --delprob prob.txt --dellength length.txt

Input data: prob.txt defines indel probability per generation for each initial sequence postion and length.txt defines the distribution of each indel

Output data: a directory PRESUMEout will be created in your working directory.

See SubstitutionModelDetails.PRESUME.pdf for more details of how to specify the GTR-Gamma model parameters.

Note that as the number of sequences are only sporadically monitored during the simulation, the number of generated descendant sequences can be fluctuated and differed from the number of sequences N required to be generated by -n.

In the distributed computing mode, the number of jobs will be around √N; PRESUME first generates ~√N number of sequences in a single node, each of which is then subjected to the further downstream process in a distributed computing node.

PRESUME Usage

Usage:
    PRESUME.py 
    [-v] [--version] [-h] [--help] [-n sequence_number] [-L sequence_length] [-s standard_deviation]
    [-e extinction_probability] [--gtrgamma model_parameters] [-m mean_substitution_rate]
    [--constant substitution_probability] [--qsub] [--output directory_path] [-f input_file]
    [--load file_name] [-u sequences_number] [--debug] [--bar] [--save] 
    [-r max_retrial_number] [--seed random_seed] [--limit time_limit] 

Options:
    -v --version
      Print PRESUME version; ignore all of the other parameters
    -h --help
      Print the usage of PRESUME; ignore all of the other parameters
    -n <Integer>
      Number of sequences to be generated. Default: 100
    -L <Integer>
      Length of sequences to be generated. Default: 1000
    -s <Float>
      Standard deviation of propagation speed. Default: 0
    -e <Float>
      Probability of extinction. Default: 0
    --gtrgamma <String>
      GTR-Gamma model parameters
        Format： --gtrgamma GTR{A-C/A-G/A-T/C-G/C-T/G-T}+FU{piA/piC/piG/piT}+G{alpha}
        For more details, see https://github.com/yachielab/PRESUME/blob/master/SubstitutionModelDetails.PRESUME.pdf
        or you can use the default parameter set by 
          --gtrgamma default
        which is equivalent to
          --gtrgamma GTR{0.03333/0.03333/0.03333/0.03333/0.03333/0.03333}+FU{0.25/0.25/0.25/0.25}+G4{10000}
    -m <Float>
      Mean of gamma distribution for relative substitution rates of different sequence 
        Positions. Default: 1
    --constant <Float>
　　　　　 Execute time-independent model to simulate sequence diversification with a parameter of
　　　　　   constant substitution probability per generation of every sequence position
    --qsub
　　　　　 Execute the distributed computing mode
    --output <String>
　　　　　 Output directory path. PRESUME creates a directory unless exists. Default: current directory
    -f <String>
　　　　　 Input FASTA file name for the root sequence. Random sequence will be generated unless specified
    -u <Integer>
　　　　　 Maximum number of sequences to be generated. Default: 1000000000
    --debug
　　　　　 Output intermediate files
    --bar
　　　　　 Activate the monitoring of simulation progress with Python tqdm module
    --save
　　　　　 Output a CSV file for parameter values used for the simulation
    --param <String>
　　　　　 CSV file for parameter values.
　　　　　 This file can be obtained from a previous simulation run executed with –-save option.
    -r <Integer>
　　　　　 Maximum number of retrials of simulation when all sequences are extinct
    --seed <Integer>
　　　　　 Seed value for generation of random values. Default: 0
    --monitor <float>
　　　　　 Stepper size parameter for monitoring of lineage generation. Default: 1
    --tree <String>
　　　　　 Input Newick format file name if a template tree is given.
　　　　　   The following parameters will be ignored:
　　　　　     -L -s -e -f -u -r --constant –-qsub –-load –-debug –-bar –-save –-seed --limit

Contact

Keito Watano (The University of Tokyo) watano.k10.yachielab@gmail.com
Naoki Konno (The University of Tokyo) naoki@bs.s.u-tokyo.ac.jp
Nozomu Yachie (The University of Tokyo) nzmyachie@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
debug		debug
example		example
images		images
submodule		submodule
.DS_Store		.DS_Store
LICENSE		LICENSE
PRESUME.py		PRESUME.py
PRESUME_help.py		PRESUME_help.py
README.md		README.md
SubstitutionModelDetails.PRESUME.pdf		SubstitutionModelDetails.PRESUME.pdf
exe_PRESUME.sh		exe_PRESUME.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRESUME Installation and User Manual

Overview of PRESUME

Supported Environment

Software Dependency

Software Installation

Installation of PRESUME

Installation of Anaconda (required)

Installation of Biopython 1.76 (required)

Installation of tqdm 4.43.0 (optional)

Sample Codes

PRESUME Usage

Contact

About

Releases

Packages

Languages

License

UTNK/PRESUME

Folders and files

Latest commit

History

Repository files navigation

PRESUME Installation and User Manual

Overview of PRESUME

Supported Environment

Software Dependency

Software Installation

Installation of PRESUME

Installation of Anaconda (required)

Installation of Biopython 1.76 (required)

Installation of tqdm 4.43.0 (optional)

Sample Codes

PRESUME Usage

Contact

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages