-
Notifications
You must be signed in to change notification settings - Fork 23
Rfam based structural annotation of a viral genome sequence
When building a VADR model of a virus sequence, secondary structural elements can be included in the covariance model used for annotating new sequences. Including these structural elements can make alignment of new sequences more accurate, as well as allow annotation of those structural elements in the VADR output.
The secondary structural elements must be added to a single
sequence 'alignment' in Stockholm
format provided as
input to the v-build.pl
program using the --stk
option. This page
will walk you through how to use the Rfam database of RNA families and
Infernal to create a structure annotated Stockholm file to use with
v-build.pl
.
(The 'Preliminary steps' section below includes instructions for downloading and installing these required files/programs)
-
infernal: software for RNA sequence and secondary structure analysis using covariance models (CMs)
-
Rfam: database of RNA families with a library of CMs
-
jiffy-infernal-hmmer-scripts: collection of perl scripts for handling Infernal/HMMER/Easel output
$ wget http://eddylab.org/software/infernal/infernal.tar.gz
$ tar zxf infernal.tar.gz
$ cd infernal-1.1.5
$ ./configure --prefix /your/install/path
$ make
$ make check # optional: run automated tests
$ make install # optional: install Infernal programs, man pages
$ (cd hmmer; make install) # optional: install HMMER programs
$ (cd easel; make install) # optional: install Easel tools
$ wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
$ gunzip Rfam.cm.gz
$ cmpress Rfam.cm
$ git clone https://github.com/nawrockie/jiffy-infernal-hmmer-scripts.git
For this example we'll use the Dengue serotype 2 RefSeq sequence NC_001474. You can download this file from GenBank (Follow links 'Send to' -> 'Complete Record' + 'File' + 'Format:Fasta' -> 'Create File' ).
It is important that the sequence name be specifically in the format 'Accession.version' (e.g. NC_001474.2
).
Note, this procedure is not exactly how the secondary structure for
the NC_001474
model included in the VADR package was
determined. That structure was determined in collaboration with
Dengue experts and only includes trusted structural elements that
were decided should be annotated on all incoming Dengue type 2
GenBank sequences. Following this procedure will result in a
different secondary structure than the VADR model.
The cmscan
program will scan all Rfam models against the NC_001474
genome sequence and report any high-scoring hits.
$ cmscan Rfam.cm NC_001474.fa > NC_001474.cmscan
Step 3. Determine which set of non-overlapping hits you want to include in your global secondary structure prediction.
Look at the cmscan
output in NC_001474.cmscan
.
Query: gi|158976983|ref|NC_001474.2| [L=10723]
Description: Dengue virus 2, complete genome
Hit scores:
rank E-value score bias modelname start end mdl trunc gc description
---- --------- ------ ----- --------------- ------ ------ --- ----- ---- -----------
(1) ! 1.9e-22 109.9 0.0 Flavivirus-5UTR 1 148 + cm no 0.40 Flavivirus 5' UTR
(2) ! 2.9e-19 90.0 0.0 Flavivirus_DB 10541 10612 + cm no 0.60 Flavivirus DB element
(3) ! 4.4e-17 91.5 0.0 Flavi_CRE 10631 10723 + cm no 0.49 Flavivirus 3' UTR cis-acting replication element (CRE)
(4) ! 4.5e-15 74.4 0.0 Flavivirus_DB 10454 10525 + cm no 0.57 Flavivirus DB element
(5) ! 9.9e-15 80.2 0.0 DENV_SLA 1 70 + cm no 0.44 Dengue virus SLA
(6) ! 3.3e-11 62.6 0.0 cHP 97 141 + cm no 0.38 Flavivirus capsid hairpin cHP
------ inclusion threshold ------
(7) ? 0.087 30.3 0.0 Flavi_ISFV_CRE 10629 10721 + cm no 0.51 Insect-specific Flavivirus 3' UTR cis-acting replication eleme
(8) ? 2.8 13.3 0.1 TTC28-AS1_2 7835 7919 + hmm - 0.46 TTC28 antisense RNA 1 conserved region 2
(9) ? 4.5 17.5 0.0 EAV_LTH 1041 999 - cm no 0.47 Equine arteritis virus leader TRS hairpin (LTH)
(10) ? 7.6 19.9 0.0 mir-1305 10723 10645 - cm 5' 0.48 mir-1305 microRNA precursor family
You'll want to strongly consider all the hits above the inclusion threshold
, and in this case all of the E-values are less than
1e-10
so these are all compelling hits that should be
included. However, the set of hits you choose to use for structure
annotation must be non-overlapping.
Note that hits (1)
overlaps with hits (5)
and (6)
. You should choose to
include the structure from either hit (1)
or hits (5)
and (6)
. In this
case, I would lean towards using (5)
and (6)
for two reasons:
A. Both hit (1)
and the pair of hits (5)
and (6)
span about
the first 140 or 150 nts, but the summed score of hits (5)
and
(6)
is higher than the score of hit (1)
by about 30 bits.
B. The model for hit (5)
is specific to Dengue virus
(description
field), while the model for hit (1)
seems to be
more general to Flavivirus.
You can also look at the per-hit alignments in the
NC_001474.cmscan
output file to see which hit is better. For help
on interpreting these alignments, see the Infernal user's
guide.
You may also want to consider some hits below the inclusion threshold
especially those with E-values below 1 and/or with model
names and descriptions that suggest they are expected in viruses
like the one you've just scanned.
In this case hit (7)
is a Flavivirus model, but it overlaps nearly
completely with hit (3)
which has a much more significant E-value,
so we should keep hit (3)
instead.
So the set of hits we'll use to structurally annotate our viral genome is:
Query: NC_001474.2 [L=10723]
Description: Dengue virus 2, complete genome
Hit scores:
rank E-value score bias modelname start end mdl trunc gc description
---- --------- ------ ----- --------------- ------ ------ --- ----- ---- -----------
(2) ! 2.9e-19 90.0 0.0 Flavivirus_DB 10541 10612 + cm no 0.60 Flavivirus DB element
(3) ! 4.4e-17 91.5 0.0 Flavi_CRE 10631 10723 + cm no 0.49 Flavivirus 3' UTR cis-acting replication element (CRE)
(4) ! 4.5e-15 74.4 0.0 Flavivirus_DB 10454 10525 + cm no 0.57 Flavivirus DB element
(5) ! 9.9e-15 80.2 0.0 DENV_SLA 1 70 + cm no 0.44 Dengue virus SLA
(6) ! 3.3e-11 62.6 0.0 cHP 97 141 + cm no 0.38 Flavivirus capsid hairpin cHP
For each of the hits above we'll use Infernal's cmsearch
program
to create a structure annotated 'alignment' of the subsequence of our
viral genome that matches the Rfam model. In this case we have 4
models so we do 4 cmsearch
calls, saving the output alignments in
Stockholm format to
a file using the -A
option:
$ cmfetch Rfam.cm Flavivirus_DB | cmsearch -A db.stk - NC_001474.fa
$ cmfetch Rfam.cm Flavi_CRE | cmsearch -A cre.stk - NC_001474.fa
$ cmfetch Rfam.cm DENV_SLA | cmsearch -A sla.stk - NC_001474.fa
$ cmfetch Rfam.cm cHP | cmsearch -A chp.stk - NC_001474.fa
After these 4 commands, we will have 4 stockholm formatted
alignments. Each of these stockholm alignments includes lines that
begin with #=GC SS_cons
which indicate the predicted secondary
structure of the viral genome subsequence that matched well to the
corresponding Rfam model. For example take a look at the chp.stk
file:
# STOCKHOLM 1.0
#=GF AU Infernal 1.1.4
#=GS NC_001474.2/97-141 DE Dengue virus 2, complete genome
NC_001474.2/97-141 AUGAAUAACCAACGGAAAAAGGCGAAAAACACGCCUUUCAAUAUG
#=GR NC_001474.2/97-141 PP *********************************************
#=GC SS_cons ::::::::::::::::<<-<<<<<<_____>>>>>>->>::::::
#=GC RF AUGAaUAAcCAACGAAaAAaGgCgGGAAAACcGcCuuUcAAUAUG
//
For the subsequent steps, we want each of our alignment files to only have a
single sequence in it. Three of these alignments will have a single
sequence in them, but one of them has 2 sequences in it: db.stk
because two of the five hits that we saved (2)
and (4)
were to this
model, while the other 3 models only had 1 hit.
You may not have any models with more than 1 sequence in the output
cmsearch
alignment, but if you do, use the
ali-multiseq-to-singleseq.pl
perl script to split those alignments
into single sequence 'alignment' files, like this:
$ perl jiffy-infernal-hmmer-scripts/ali-multiseq-to-singleseq.pl db.stk
Saved single-sequence alignment with sequence NC_001474.2/10541-10612 to db.stk.1
Saved single-sequence alignment with sequence NC_001474.2/10454-10525 to db.stk.2
Step 6: Convert the stockholm alignments to one-line-per-sequence 'Pfam' stockholm format and remove any gap columns
The esl-alimask
program is installed with infernal.
$ esl-alimask --outformat pfam --keepins -g db.stk.1 > db1.pfam
$ esl-alimask --outformat pfam --keepins -g db.stk.2 > db2.pfam
$ esl-alimask --outformat pfam --keepins -g cre.stk > cre.pfam
$ esl-alimask --outformat pfam --keepins -g sla.stk > sla.pfam
$ esl-alimask --outformat pfam --keepins -g chp.stk > chp.pfam
Step 7: Merge the secondary structure annotations into a single structure annotation string the length of the full genome
The next step is to merge the secondary structure annotation from each of the alignment files into a single secondary structure annotation string. It is crucial that none of the Rfam hits overlap in sequence coordinates in the viral genome, or else this step will not work.
You'll use the ali-sscons-merge.pl
script for this and you'll have to figure out
how many positions are missing from the 5' and 3' ends of each of the hits. For example,
The first Flavivirus_DB
hit in db1.pfam
spans positions 10541
to 10612
(see cmscan
output above).
The NC_001474.2
sequence is 10723
nucleotides long (you can determine this with the command esl-seqstat NC_001474.fa
),
so it is missing 10540
positions on the 5' end and 10723-10612=111
positions on the 3' end.
The second Flavivirus_DB
hit in db2.pfam
spans positions 10631
to 10723
. You then
use the --one_m5
, --one_m3
, --two_m5
, and --two_m3
options to ali-sscons-merge.pl
to merge them:
perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --one_m5 10540 --one_m3 111 --two_m5 10453 --two_m3 198 db1.pfam db2.pfam > merge1.pfam
After this the merge1.pfam
'alignment' will include the sequence from the first alignment file used in the previous ali-sscons-merge.pl
call but will have gaps added to make it the full length (10723
) of the viral genome, and it will also include a SS_cons
annotation
line that includes the predicted secondary structure from both db1.pfam
and db2.pfam
. You then use this as the first alignment
in subsequent ali-sscons-merge.pl
calls to add in the additional
secondary structure from the additional hits with the below three
commands. Because the first alignment in these cases will be the full
length, the --one_m5
and --one_m3
options should not be used.
$ perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --two_m5 10630 --two_m3 0 merge1.pfam cre.pfam > merge2.pfam
$ perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --two_m5 0 --two_m3 10653 merge2.pfam sla.pfam > merge3.pfam
$ perl jiffy-infernal-hmmer-scripts/ali-sscons-merge.pl --two_m5 96 --two_m3 10582 merge3.pfam chp.pfam > merge4.pfam
There are several commands required for this step.
First, reformat the fasta file to a single sequence pfam format alignment:
$ esl-reformat pfam NC_001474.fa > NC_001474.pfam
And add RF annotation to it:
$ perl jiffy-infernal-hmmer-scripts/ali-pfam-add-x-rf.pl NC_001474.pfam > NC_001474.rf.pfam
Next, extract the full length SS_cons
annotation as a string and output it to a new text file:
$ cat merge4.pfam | grep SS_cons | awk '{ print $3 }' > merge.sscons
And finally, add that SS_cons
annotation to the NC_001474
alignment and convert it
to interleaved Stockholm format so it is easier to look at:
$ perl jiffy-infernal-hmmer-scripts/ali-pfam-add-rflen-sscons.pl NC_001474.rf.pfam merge.sscons > NC_001474.sscons.pfam
$ esl-reformat stockholm NC_001474.sscons.pfam > NC_001474.sscons.stk
There's many ways we could have screwed something up. One of the commands may not have not executed properly, or there could be a bug in one or more of the scripts, or alternatively these steps may just not work for every virus and set of Rfam hits, so we want to check that the structure is valid before continuing. One way to do this is to count how many of the basepairs are Watson-Crick or GU/UG versus non-canonical. If most of the basepairs are non-canonical then we know that something went wrong.
To check this:
$ esl-alistat --bpinfo NC_001474.bpinfo NC_001474.sscons.stk
$ perl jiffy-infernal-hmmer-scripts/esl-alistat-bpinfo-count-compensatory-and-consistent.pl NC_001474.bpinfo > NC_001474.bpinfo.ct
Take a look at the NC_001474.bpinfo.ct
file, extra column #8 (mcwcgu
) will be 'yes' if most common basepair is WC or GU/UG, else 'no'.
$ head -n 24 NC_001474.bpinfo.ct
# Per-column basepair counts:
# Alignment file: NC_001474.sscons.stk
# Alignment idx: 1
# Number of sequences: 1
# Only basepairs involving two canonical (non-degenerate) residues were counted.
# Sequence weights from alignment were ignored (if they existed).
#
# extra column 1: '#wc|gu': number of Watson-Crick+GU+UG basepairs
# extra column 2: '#wc': number of Watson-Crick basepairs
# extra column 3: '#gu': number of GU+UG basepairs
# extra column 4: 'mc': identity of most common basepair
# extra column 5: '#mc': number of most common basepair
# extra column 6: '#bp!mc': number of basepairs != most common basepair
# extra column 6: '#nt!mc': number of nucleotide changes in basepairs != most common basepair
# extra column 8: 'mcwcgu': 'yes' if most common basepair is WC or GU/UG, else 'no'
# extra column 9: '#comp': if 'mcwcgu' is 'yes', number of compensatory nucleotide changes to WC or GU/UG (2 per each mutated basepair), else '-'
# extra column 10: '#cons': if 'mcwcgu' is 'yes', number of consistent nucleotide changes to WC or GU/UG (1 per each mutated basepair), else '-'
# extra column 11: '#incon': if 'mcwcgu' is 'yes', number of inconsistent changes to non-WC/GU/UG (1 or 2 per each mutated basepair), else '-'
#
# lpos rpos AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT #wc|gu #wc #gu mc #mc #bp!mc #nt!mc mcwcgu #comp #cons #incon
# ------- ------- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
4 69 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 UA 1 0 0 yes 0 0 0
5 68 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 GC 1 0 0 yes 0 0 0
6 67 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 UA 1 0 0 yes 0 0 0
The mcwcgu
field will be the 26th space-delimited field for every non-blank line that does not begin with #
, so to count them:
$ grep -v ^\# NC_001474.bpinfo.ct | awk '{ print $26 }' | grep . | sort | uniq -c
5 no
84 yes
If there were more no
values than yes
values, it would be an indication that our secondary structure annotation was probably messed up somehow.
Another way we can catch problems isf if any of the steps above fail, hopefully with an informative error message.
If you haven't yet built your VADR model, you can use v-build.pl
to build it and specify that
the cmbuild
command used to build the CM use your new stockholm alignment NC_001474.sscons.stk
with the command:
$ v-build.pl -f --stk NC_001474.sscons.stk NC_001474 test
If you've already put some work into manually refining your model and do not want to start from scratch again
using v-build.pl
you can rebuild the model using cmbuild
. You'll want to use the same command that
v-build.pl
uses. You can find this in the vadr.cmd
output file in the directory that was created when you
originally built the model with v-build.pl
. That command should look something like:
$ cmbuild -n NC_001474 --verbose --occfile NC_001474/NC_001474.vadr.cmbuild.occ --cp9occfile NC_001474/NC_001474.vadr.cmbuild.cp9occ --fp7occfile NC_001474/NC_001474.vadr.cmbuild.fp7occ NC_001474/NC_001474.vadr.cm NC_001474/NC_001474.vadr.stk > NC_001474/NC_001474.vadr.cmbuild
The *--occfile
options are not essential, so to rerun this, you'd do:
$ cmbuild -n NC_001474 --verbose NC_001474.vadr.cm NC_001474.sscons.stk
And you'll want to copy that CM over the original one, probably after saving a copy of the original, if it is in the directory NC_001474
those commands would be:
$ cp NC_001474/NC_001474.vadr.cm NC_001474/orig.NC_001474.vadr.cm
$ cp NC_001474.vadr.cm NC_001474/NC_001474.vadr.cm
Finally, you need to run cmpress
to index the file, but first delete the old index files for the original model:
$ rm NC_001474/NC_001474.vadr.cm.*
$ cmpress NC_001474/NC_001474.vadr.cm
If you'd like to include the Rfam hits as features in your annotation output, you can manually add them
to the .minfo
file that v-build.pl
created. Take a look at that file:
MODEL NC_001474 blastdb:"test.vadr.protein.fa" length:"10723"
FEATURE NC_001474 type:"gene" coords:"97..10272:+" parent_idx_str:"GBNULL" gene:"POLY"
FEATURE NC_001474 type:"CDS" coords:"97..10272:+" parent_idx_str:"GBNULL" gene:"POLY" product:"polyprotein"
FEATURE NC_001474 type:"mat_peptide" coords:"97..438:+" parent_idx_str:"1" product:"anchored capsid protein ancC"
FEATURE NC_001474 type:"mat_peptide" coords:"97..396:+" parent_idx_str:"1" product:"capsid protein C"
FEATURE NC_001474 type:"mat_peptide" coords:"439..936:+" parent_idx_str:"1" product:"membrane glycoprotein precursor prM"
FEATURE NC_001474 type:"mat_peptide" coords:"439..711:+" parent_idx_str:"1" product:"protein pr"
FEATURE NC_001474 type:"mat_peptide" coords:"712..936:+" parent_idx_str:"1" product:"membrane glycoprotein M"
FEATURE NC_001474 type:"mat_peptide" coords:"937..2421:+" parent_idx_str:"1" product:"envelope protein E"
FEATURE NC_001474 type:"mat_peptide" coords:"2422..3477:+" parent_idx_str:"1" product:"nonstructural protein NS1"
FEATURE NC_001474 type:"mat_peptide" coords:"3478..4131:+" parent_idx_str:"1" product:"nonstructural protein NS2A"
FEATURE NC_001474 type:"mat_peptide" coords:"4132..4521:+" parent_idx_str:"1" product:"nonstructural protein NS2B"
FEATURE NC_001474 type:"mat_peptide" coords:"4522..6375:+" parent_idx_str:"1" product:"nonstructural protein NS3"
FEATURE NC_001474 type:"mat_peptide" coords:"6376..6756:+" parent_idx_str:"1" product:"nonstructural protein NS4A"
FEATURE NC_001474 type:"mat_peptide" coords:"6757..6825:+" parent_idx_str:"1" product:"protein 2K"
FEATURE NC_001474 type:"mat_peptide" coords:"6826..7569:+" parent_idx_str:"1" product:"nonstructural protein NS4B"
FEATURE NC_001474 type:"mat_peptide" coords:"7570..10269:+" parent_idx_str:"1" product:"RNA-dependent RNA polymerase NS5"
You'll want to add new FEATURE
lines for each of the 5 Rfam hits that look something like:
FEATURE NC_001474 type:"ncRNA" coords:"1..70:+" parent_idx_str:"GBNULL" product:"Dengue virus SLA (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"97..141:+" parent_idx_str:"GBNULL" product:"Flavivirus capsid hairpin cHP (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"10454..10525:+" parent_idx_str:"GBNULL" product:"Flavivirus DB element (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"10541..10612:+" parent_idx_str:"GBNULL" product:"Flavivirus DB element 1 (Rfam 14.9)"
FEATURE NC_001474 type:"ncRNA" coords:"10631..10723:+" parent_idx_str:"GBNULL" product:"Flavivirus DB element 2 (Rfam 14.9)"
After doing this, when you run v-annotate.pl
the ncRNA
features will be annotated in the .tbl
and .ftr
output files.