-
Notifications
You must be signed in to change notification settings - Fork 106
eggNOG mapper v1
- Python 2.7+
- wget
- HMMER 3 and/or DIAMOND binaries available (otherwise using the ones packaged with eggNOG-mapper)
- BioPython (required only if using the
--translate
option)
-
~20GB for the eggNOG annotation database
-
~20GB for eggNOG fasta files
-
~130GB for the three optimized eggNOG databases (euk, bact, arch), and from 1GB to 35GB for each taxonomic-specific eggNOG HMM database. You don't have to download all, just pick the ones you are interested.
(you can check the size of individual datasets at http://beta-eggnogdb.embl.de/download/eggnog_4.5/hmmdb_levels/)
eggnog-mapper allows to run very fast searches by allocating the target HMM
databases into memory (using the HMMER3 hmmpgmd program). This is enabled when
using of the --usemem
or --servermode
flags, and it will require a lot of
RAM memory (depending on the size of the target database). As a reference:
- ~90GB to load the optimized eukaryotic databases (
euk
) - ~32GB to load the optimized bacterial database (
bact
) - ~10GB to load the optimized archeal database (
arch
)
Note:
- Searches are still possible in low memory systems. However, Disk I/O will be the a bottleneck (specially when using multiple CPUs). Place databases in the fastest disk posible.
-
Download and decompress the latest version of eggnog-mapper from https://github.com/jhcepas/eggnog-mapper/releases. The program does not require compilation nor installation.
-
or clone this git repository:
git clone https://github.com/jhcepas/eggnog-mapper.git
-
eggNOG mapper provides 107 taxonomically restricted HMM databases (
xxxNOG
), three optimized databases [Eukaryota (euk
), Bacteria (bact
) and Archea (arch
)] and a virus specific database (viruses
). -
The three optimized databases include all HMM models from their corresponding taxonomic levels in eggNOG (euNOG, bactNOG, arNOG) plus additional models spliting large alignments into taxonomically restricted (smaller) HMM models. In particular, HMM models with more than 500 (euk) or 50 (bact) sequences are expanded. The
arch
database includes all models from all archeal taxonomic levels in eggNOG. -
taxonomically restricted databases are listed here. They can be referred by its code (i.e. maNOG for Mammals).
To download a given database, execute the download script providing a the list of databases to fetch:
download_eggnog_data.py euk bact arch viruses
This will fetch and decompress all precomputed eggNOG data into the data/ directory.
- Disk based searches on the optimized bacterial database
python emapper.py -i test/polb.fa --output polb_bact -d bact
- Disk based searches on the optimized database of viral models
python emapper.py -i test/polb.fa --output polb_viruses -d viruses
- Disk based searches on the mammal specific database
python emapper.py -i test/p53.fa --output p53_maNOG -d maNOG
- Memory based searches on the mammal specific database
python emapper.py -i test/p53.fa --output p53_maNOG -d maNOG --usemem
- Using DIAMOND as search method.
Note that no target database is required when using DIAMOND.
python emapper.py -i test/p53.fa --output p53_maNOG -m diamond
For each query sequence, a list of significant hits to eggNOG Orthologous Groups (OGs) is reported. Each line in the file represents a hit, where evalue, bit-score, query-coverage and the sequence coordinates of the match are reported. If multiple hits exist for a given query, results are sorted by e-value.
each line in the file provides the best match of each query within the best Orthologous Group (OG) reported in the [project].hmm_hits
file, obtained running PHMMER against all sequences within the best OG. The seed ortholog is used to fetch fine-grained orthology relationships from eggNOG. If using the diamond
search mode, seed orthologs are directly obtained from the best matching sequences by running DIAMOND against the whole eggNOG protein space.
This file provides final annotations of each query. Tab-delimited columns in the file are:
-
query_name
: query sequence name -
seed_eggNOG_ortholog
: best protein match in eggNOG -
seed_ortholog_evalue
: best protein match (e-value) -
seed_ortholog_score
: best protein match (bit-score) -
predicted_gene_name
: Predicted gene name for query sequences -
GO_terms
: Comma delimited list of predicted Gene Ontology terms -
KEGG_KO
: Comma delimited list of predicted KEGG KOs -
BiGG_Reactions
: Comma delimited list of predicted BiGG metabolic reactions -
Annotation_tax_scope
: The taxonomic scope used to annotate this query sequence -
Matching_OGs
: Comma delimited list of matching eggNOG Orthologous Groups -
best_OG|evalue|score
: Best matching Orthologous Groups (only in HMM mode) -
COG functional categories
: COG functional category inferred from best matching OG -
eggNOG_HMM_model_annotation
: eggNOG functional description inferred from best matching OG
If only one input file is going to be annotated, simply use the --usemem
and
--cpu XX
options. For instance:
python emapper.py -i test/polb.fa --output polb_pfam -d pfam/pfam.hmm --usemem --cpu 10
If you are planning to use the same database for annotating multiple files, you can start eggnog-mapper in server mode (this will load the target database in memory and keep it there until stopped). Then you can use another eggnog-mapper instance to connect to the server. For instance,
In terminal 1, execute:
python emapper.py -d arch --cpu 10 --servermode
This will load the memory and give you the address to connect to the database. Then, in a different terminal, execute:
python emapper.py -d arch:localhost:51600 -i test/polb.fa -o polb_arch
You can also provide a custom hmmpressed HMMR3 database. For this, just provide
the path and base name of the database (removing the .h3f
like extension).
python emapper.py -i test/polb.fa --output polb_pfam -d pfam/pfam.hmm
The following recommendations are based on the different experiences annotating huge genomic and metagenomic datesets (>100M proteins).
eggNOG mapper works at two phases: 1) finding seed orthologous sequences 2) expanding annotations. 1 is mainly cpu intensive, while 2 is more about disk operations. You can therefore optimize the annotation of huge files, but running each phase on different setups.
- Split your input FASTA file into chunks, each containing a moderate number of sequences (1M seqs per file worked good in our tests). We usually work with FASTA files where sequences are in a single line, so splitting is very simple.
split -l 2000000 -a 3 -d input_file.faa input_file.chunk_
- Use diamond mode. Each chunk can be processed independently in a cluster node, and you should tell
emapper.py
not to run the annotation phase yet. This way you can parallelize diamond searches as much as you want, even when running from a shared file system. Assuming an example with 100M proteins, the above command will generate 100 file chunks, and each should run diamond using 16 cores. The necessary commands that need to be submitted to the cluster queue can be generated with something like this:
# generate all the commands that should be distributed in the cluster
for f in *.chunk_*; do
echo ./emapper.py -m diamond --no_annot --no_file_comments --cpu 16 -i $f -o $f;
done
The annotation phase needs to query data/eggnog.db
intensively. This file is a sqlite3 database, so it is highly recommended that the file lives under the fastest local disk possible. For instance, we store eggnog.db
in SSD disks or, if possible, under /dev/shm
(memory based filesystem).
- Concatenate all chunk_*.emapper.seed_orthologs file.
cat *.chunk_*.emapper.seed_orthologs > input_file.emapper.seed_orthologs
- Run the orthologs search and annotation phase in a single multi core machine (10 cores in our example), reading from a fast disk.
emapper.py --annotate_hits_table input.emapper.seed_orthologs --no_file_comments -o output_file --cpu 10
We usually annotate at a rate of 300-400 proteins per second using a 10 cpu cores and having eggnog.db
under the /dev/shm
disk, but you can of course run many of those instances in parallel. If you are running emapper.py
from a conda environment, check these tips.
and voilà, you got your annotations.