Skip to content

migrau/phylopipe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Phylogeny pipeline

Description

Snakemake scripts used to perform a phylogeny.

Contents

  • src/snk_orthomcl.py. Run orthomcl using a batch of protein fasta files.
  • src/snk_groups2fasta.py. Create a fasta file for each ortholog group.
  • src/snk_phylo.py. Obtain raxml tree from a batch of unaligned fasta files, using codon mode.
  • data. Required fasta files to complete an example run.

snk_orthomcl

It runs all the steps of OrthoMCL automatically, parallelizing blast (based on http://orthomcl.org/common/downloads/software/v2.0/UserGuide.txt).

  • Input files: CDS proteins (obtanied from transdecoder for example).
  • Output file: groups.txt, containing all the ortholog clusters.

Considerations

  • An empty mysql database is required.
  • FASTA files should be renamed using a code of only four letters: XXXX.fasta
  • orthomcl.config file is also required (attached).
  • Blast step is parallelized, by default 50 threads.

Prerequisites

Possible error

  1. For rule adjustFASTA, the user has to indicate the position of the unique ID. By default, it is 2, change if necessary.

Usage

$ snakemake --snakefile snk_orthomcl.py --config fastadir=../data/ -j 24 (-np)

snk_groups2fasta

Generate the fasta files containing the family genes from the ortholog groups. FASTA files will be created ONLY in those cases where clusters contain at least one sample from every specie included in the study.

  • Input files: groups.txt from orthomcl output AND fasta CDS files for each sample.
  • Output file: fasta files with the ortholog RNA sequences.

Considerations

  • The fasta files should be renamed with the same 4 letters codes used in the previous orhomcl run.

Usage

snakemake --snakefile snk_groups2fasta.py --config fastadir=../data/ -j 24 (-np) 

snk_phylo

From a batch of fasta files of family genes (orthomcl groups output) it obtains a phylogenetic tree.

  • Input files: Fasta files, containing ortholog groups (output of snk_groups2fasta).
  • Output file: RAxML tree.

Rules's description

  1. Remove duplicate records. One record from each sample is kept (longest or random).
  2. Translate clusters to protein and do alignment with muscle.
  3. Arrange alignments by ID.
  4. Concatenate and create partitions file with length coordinates.
  5. Convert fasta concatenated file to phy.
  6. Build phylogenetic tree: raxml.

Considerations

  • All the fasta files should contain records (at least one) from ALL the samples in our study.
  • The IDs from fasta files should have only 4 letters (orthomcl requeriments)(>XXXX|).
  • For translation to protein, transeq or biopython are available (default biopython).
  • For remove duplicates, two options: remove the shorter ones or remove them randomly (default random).
  • Default raxml bootstrap: 100.

Prerequisites

  • transeq from EMBOSS, in case the user prefer to use it (instead biopython) to perform the protein translation.
  • Biotpython and bioawk are required in any case.

Usage (slurm example)

$ srun --partition=compute --time 7-0 --mem=10G --cpus-per-task=24 --ntasks=1 --pty bash 
$ module load python/3.5.0 
(dry run) $ snakemake --snakefile snk_phylo.py --config fastadir=CDSgroups/ -j 24 -np 
$ snakemake --snakefile snk_phylo.py --config fastadir=CDSgroups/ -j 24

Steps

Obtain ortholog groups with snk_orthomcl, extract sequences using snk_groups2fasta and finally run snk_phylo to build the phylogeny tree.

  • Input files: Fasta files. Proteins and CDS
  • Output file: RAxML tree.

Coming Soon

  • Run MEGA calibration with raxml results.
  • Add different aligner (prank) in snk_phylo.
  • Run paml analysis with snk_groups2fasta output.

About

Phylogeny pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages