-
Notifications
You must be signed in to change notification settings - Fork 2
Build STAG database for genomes
To create a STAG database for genomes you will need many inputs. The analysis of a genome is based on a selection of marker genes. Let's say that we have two marker genes: COG12 and COG18.
The first thing is to create a stag database for each of the marker genes.
Hence using stag train -o COG12.stagDB [...]
and stag train -o COG18.stagDB [...]
When concatenating alignments, we obtain higher resolution for the assignments. Hence, for the annotation of the entire genome, we use build a new stag database trained on the concatenation of all the marker genes. Note, it is important the order of the genes.
In order to create this database, we first need to create stag alignments of the training sequences using stag align
. Then you will need to manually concatenate the alignments (taking care of missing genes as well). The result of this is a file, let's call it concatenated_alis.tsv
.
Finally, you an use stag create_db
to create a new database. Note: you need to provide a hmm file, but here it will not be used. Hence, you can put any dummy file.
A call like:
stag create_db -s concatenated_alis.tsv -x taxonomy_file -a dummy_file -o concatenated_ali.stagDB
will produce the needed database.
We need a file that will contain the genes hmm thresholds. The thresholds refer to the full sequence / score
returned from hmmsearch
.
This is a tab separated file (let's call it thresholds.tsv
) like:
COG12.stagDB 60
COG18.stagDB 120
where the first column should have the same name as the single genes stag database, and the second column is the threshold score.
Note that the order of the genes MUST be the same as the order used in the concatenation of genes to create concatenated_ali.stagDB
.
We can create the genome STAG database with:
stag train_genome -i COG12.stagDB,COG18.stagDB -T thresholds.tsv -C concatenated_ali.stagDB -o genomes.stagDB
Now, you can classify a genome with:
stag classify_genome -d genomes.stagDB [...]