A fork of Pharokka to handle enVhogs
- megapharokka
- Table of Contents
- Outline
- Benchmarking
- Installing megapharokka
- Running megapharokka
- Database
- Megapharokka will Run MMSeqs2 (default of
--mmseqs2
), PyHMMER (--pyhmmer
) or HHsuite (--hhsuite
) against appropriately formatted ENVHOG databases. - Documentation in Database.
- I would recommend only using
--mmseqs2
.--hhsuite
takes too long for large datasets. - Need to do some testing for optimal
--sensitivity
. - Gene predictor defaults to
-g pyrodigal-gv
. - Massive datasets recommended to be run with
--skip_mash
.
- All benchmarking was conducted on a Intel® Core™ i7-10700K CPU @ 3.80GHz on a machine running Ubuntu 20.04.6 LTS with
-t 8
threads. - The 673 crAss-like contig dataset from Yutin et al was used as a test set. Yutin, N., Benler, S., Shmakov, S.A. et al. Analysis of metagenome-assembled viral genomes from the human gut reveals diverse putative CrAss-like phages with unique genomic features. Nat Commun 12, 1044 (2021) https://doi.org/10.1038/s41467-021-21350-w.
-g prodigal-gv
was used as gene predictor for all options.megapharokka
= 2.2M ENVHOGs consensus sequences.pharokka
= 38880 PHROGs.- By default,
pharokka
uses MMSeqs2 sensitivity of 8.5. --hhsuite
on a small subset took >20 hours, so rule that out as a viable option for metagenomes.
Overall, it is clear that megapharokka
with --mmseqs2
oon ENVHOGs is more sensitive than regular -pharokka
on PHROGs even with --fast
(pyhmmer).
673 crAss-like genomes from Yutin et al., 2021 | megapharokka --mmseqs2 --sensitivity 4 |
megapharokka --mmseqs2 --sensitivity 3 |
megapharokka --mmseqs2 --sensitivity 2 |
pharokka v1.5.0 mmseqs2 |
pharokka v1.5.0 --fast (pyhmmer) |
---|---|---|---|---|---|
Time (min) | 25.21 | 16.73 | 8.23 | 17.43 | 61.73 |
CDS | 74980 | 74980 | 74980 | 74980 | 74980 |
Annotated Function CDS | 17724 | 17540 | 17267 | 10131 | 16357 |
Unknown Function CDS | 57256 | 57440 | 57713 | 64849 | 58623 |
- Clone
git clone https://github.com/gbouras13/megapharokka
cd megapharokka
- Create conda env with external dependencies installed
mamba env create -f environment.yml
conda activate megapharokka_env
- Install
megapharokka
and most python dependecies withpip
- Note: The inimitable Martin Larralde @althonos, the creator and maintainer of these tools at EMBL Heidelberg just made production releases:
pyrodigal
v3 is required to be compatible withpyrodigal-gv
v0.1.0
pip install .
- Megapharokka will Run MMSeqs2 (default of
--mmseqs2
) or PyHMMER (--pyhmmer
) or HHsuite (--hhsuite
) against appropriately formatted ENVHOG databases. --skip_extra_annotations
will skip tRNAscan-SE, MINced and Aragorn--skip_mash
will skip Mash sketch against INPHARED.- Run
-m
meta mode (now irrelevant as we are not splitting files but indicate anyway). -s
creates split directories for your contig fastas, gffs and gbks.- You can also use
--dnaapler
to reorient contigs to begin with the large terminase subunit. Wouldn't recommend unless you know the contigs are complete.
threads=8
# mmseqs2
megapharokka.py -i input.fasta -o output_dir -t $threads -d envhogs_db -g 'prodigal-gv' --mmseqs2 -m -s -f --skip_extra_annotations
# pyhmmer
megapharokka.py -i input.fasta -o output_dir -t $threads -d envhogs_db -g 'prodigal-gv' --pyhmmer -m -s -f --skip_extra_annotations
# hhsuite
megapharokka.py -i input.fasta -o output_dir -t $threads -d envhogs_db -g 'prodigal-gv' --hhsuite -m -s -f --skip_extra_annotations
Database directory requires the usual VFDB, CARD and INPHARED files from Pharokka v1.4.0.
You can install this with
install_databases.py -d megapharokka_db
It requires ENVHOG database annotation and MMSeqs2 database:
envhogs_annot_140923.tsv
for annotation.EnVhog_consensus
MMSeqs2 database with--mmseqs2
It optionally takens the ENVHOG database files:
enVhogs.h3m
with--pyhmmer
EnVhog_hmm.ffdata
,EnVhog_hmm.ffindex
,EnVhog_a3m.ffdata
,EnVhog_a3m.ffindex
,EnVhog_cs219.ffdata
,EnVhog_cs219.ffindex
, hhsuite database with--hhsuite
usage: megapharokka.py [-h] [-i INFILE] [-o OUTDIR] [-d DATABASE] [-t THREADS] [-f] [-p PREFIX] [-l LOCUSTAG] [-g GENE_PREDICTOR] [-m] [-s] [-c CODING_TABLE] [-e EVALUE] [--dnaapler] [--hhsuite]
[--mmseqs2] [--pyhmmer] [--skip_extra_annotations] [-V] [--citation]
megapharokka: fast phage annotation program with enVhogs
options:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
Input genome file in fasta format.
-o OUTDIR, --outdir OUTDIR
Directory to write the output to.
-d DATABASE, --database DATABASE
Database directory. If the databases have been installed in the default directory, this is not required. Otherwise specify the path.
-t THREADS, --threads THREADS
Number of threads. Defaults to 1.
-f, --force Overwrites the output directory.
-p PREFIX, --prefix PREFIX
Prefix for output files. This is not required.
-l LOCUSTAG, --locustag LOCUSTAG
User specified locus tag for the gff/gbk files. This is not required. A random locus tag will be generated instead.
-g GENE_PREDICTOR, --gene_predictor GENE_PREDICTOR
User specified gene predictor. Use "-g phanotate" or "-g prodigal" or "-g prodigal-gv".
Defaults to phanotate (not required unless prodigal is desired).
-m, --meta meta mode for metavirome input samples
-s, --split split mode for metavirome samples. -m must also be specified.
Will output separate split FASTA, gff and genbank files for each input contig.
-c CODING_TABLE, --coding_table CODING_TABLE
translation table for prodigal. Defaults to 11. Experimental only.
-e EVALUE, --evalue EVALUE
E-value threshold for MMseqs2 database PHROGs, VFDB and CARD and PyHMMER PHROGs database search. Defaults to 1E-05.
--dnaapler Runs dnaapler to automatically re-orient all contigs to begin with terminase large subunit if found.
Recommended over using '--terminase'.
--hhsuite Runs hhsuite on EnVhogs.
--mmseqs2 Runs MMSeqs2 on EnVhogs.
--pyhmmer Runs pyhmmer on EnVhogs.
--skip_extra_annotations
Skips tRNAscan-se, MINced and Aragorn.
-V, --version Print pharokka Version
--citation Print pharokka Citation