Skip to content

Build STAG database for genomes

Alessio Milanese edited this page Feb 9, 2021 · 6 revisions

To create a STAG database for genomes you will need many inputs. The analysis of a genome is based on a selection of marker genes. Let's say that we have two marker genes: COG12 and COG18.

Create STAG database for single genes

The first thing is to create a stag database for each of the marker genes. Hence using stag train -o COG12.stagDB [...] and stag train -o COG18.stagDB [...]

Create STAG database for concatenated sequences

When concatenating alignments, we obtain higher resolution for the assignments. Hence, for the annotation of the entire genome, we use build a new stag database trained on the concatenation of all the marker genes. Note, it is important the order of the genes.

In order to create this database, we first need to create stag alignments of the training sequences using stag align. Then you will need to manually concatenate the alignments (taking care of missing genes as well). The result of this is a file, let's call it concatenated_alis.tsv.

Finally, you an use stag create_db to create a new database. Note: you need to provide a hmm file, but here it will not be used. Hence, you can put any dummy file.

A call like:

stag create_db -s concatenated_alis.tsv -x taxonomy_file -a dummy_file -o concatenated_ali.stagDB

will produce the needed database.

Define HMM thresholds

We need a file that will contain the genes hmm thresholds. The thresholds refer to the full sequence / score returned from hmmsearch. This is a tab separated file (let's call it thresholds.tsv) like:

COG12.stagDB   60
COG18.stagDB   120

where the first column should have the same name as the single genes stag database, and the second column is the threshold score.

Note that the order of the genes MUST be the same as the order used in the concatenation of genes to create concatenated_ali.stagDB.

Create final STAG database

We can create the genome STAG database with:

stag train_genome -i COG12.stagDB,COG18.stagDB -T thresholds.tsv -C concatenated_ali.stagDB -o genomes.stagDB

Now, you can classify a genome with:

stag classify_genome -d genomes.stagDB [...]