VADR doesn't annotate second segment of segmented CoV genome #52

taltman · 2021-12-16T22:22:47Z

The Serratus Project expanded the set of known CoV/nidovirus genomes, including segmented ones. An example of a segmented nidovirus similar to the ones that we found is the Pacific salmon nidovirus (MK611985.1). Please see Figure 3 of our preprint for more context:
https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2

When I try to annotate the AmexNV genome, with two segments in the input FASTA file, VADR 1.3 annotates the first segment, and then reports the following for the second one:

>Feature NODE_11_length_12596_cov_95.354468

Additional note(s) to submitter:
ERROR: NO_ANNOTATION: (*sequence*) no significant similarity detected [-]; seq-coords:-; mdl-coords:-; mdl:-;

Yet, when I concatenate the two contigs with a run of 16 Ns: I get additional annotations (see below). Is there a way for VADR to recognize the multiple segments, and annotate them individually? (see below for the input files used)

Additional annotations:

22167   27672   gene
                        gene    S
22167   27672   CDS
                        product spike glycoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_3
27717   28212   gene
                        gene    orf4
27717   28212   CDS
                        product non-structural protein
                        protein_id      NODE_3_length_19124_cov_65.568632_4
28193   28627   gene
                        gene    E
28193   28627   CDS
                        product small membrane protein
                        protein_id      NODE_3_length_19124_cov_65.568632_5
28639   29602   gene
                        gene    M
28639   29602   CDS
                        product membrane glycoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_6
29646   31439   gene
                        gene    N
29646   31439   CDS
                        product nucleocapsid phosphoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_7
29665   30389   gene
                        gene    N2
29665   30389   CDS
                        product nucleocapsid phosphoprotein 2
                        protein_id      NODE_3_length_19124_cov_65.568632_8

Original FASTA file with two segments:
SRR6788790.epsy.fa.txt

Modified FASTA with the two segments concatenated:
AmexNV-one-contig-test.fa.txt

The text was updated successfully, but these errors were encountered:

nawrockie · 2021-12-18T12:51:01Z

@taltman can you please provide the v-annotate.pl command you used?

taltman · 2021-12-19T07:06:34Z

@nawrockie , here is the darth.sh script that I use to call v-annotate.pl:

https://bitbucket.org/tomeraltman/darth/src/dev/src/darth.sh

The relevant section:

    v-annotate.pl \
	--mdir $data_dir/vadr-models-corona-1.3-3 \
	--mkey corona \
	--mxsize 64000 \
	-f \
	--keep \
	--nomisc \
	$output_parent_dir/transeq/canonical.fna \
	$output_dir \

Also, as a reminder the Docker container for testing out the environment that I'm using is taltman/darth:tyranus. The Dockerfile in the above Bitbucket repo shows how I set things up.

Thanks for any insights you might be able to share!

taltman · 2021-12-20T18:00:37Z

I guess I could do the following:

Concatenate the two segments
Run VADR
Do some interval arithmetic to adjust the annotations on the second segment

It's not too hard, but I was hoping there might be a more elegant way to do it. Please let me know if you have any suggestions. Thank you!

nawrockie · 2021-12-20T21:17:53Z

Thanks for providing the command that you used.
I was able to reproduce your results and I took a closer
look. The reason you get the NO_ANNOTATION error for the second
sequence in your SRR6788790.epsy.fa.txt file is because that sequence
is not recognized as homologous (based on sequence similarity)
by vadr to any of the coronavirus RefSeq sequences. The first sequence
is recognized as homologous though, so when you concatenate them the
homology from the first sequence in the concatenated sequence makes the
concatenated sequence get recognized as homologous.

After recognizing that the concatenated sequence is homologous
VADR then aligns the entire sequence to the best matching model
and infers annotation based on that alignment. This is why you get
additional annotation that extends beyond the region of homology
(to the spike protein, etc.) in the region of the second sequence.

That region doesn't match well to the model, but the alignment has been
forced and the inferred positions of all the features from the RefSeq
are reported in the .ftr and .tbl files. Note that there are many
serious alerts/errors reported for these features (early stop codons,
mutated start codons, frameshifts, lack of protein homology from
blastx (indefant)) indicating that the annotations for these features
are not trustworthy.

Unfortunately, if you want to force vadr to annotate these seqs
and you probably will need to manually concatenate them.

I see that you are trying to do remote homology detection/annotation
but using vadr with the coronavirus RefSeq models is very limited in
this respect because it works primarily in nucleotide space. It was
really only designed for annotating sequences highly similar to the
model RefSeq. You're just coming up against the limit of the capability of
the tool.

That said, the best way to maximize the remote homology detection
ability of vadr would be to create a diverse multiple sequence alignment of
coronavirus genomes and make a single profile model from that
alignment. You'd also have to know the feature positions in the
reference coordinate space to put in the .minfo file.

However, the power of this nucleotide profile for remote homology
detection would still be far less than any protein-based homology
search method.

taltman · 2021-12-27T22:49:00Z

Thanks for your reply. Yes, we are using VADR to annotate CoV genomes that are very distant from the known CoV genomes in GenBank. I agree with your described approach ( we in fact used protein-space searches to find these distant CoV genomes). But due to lack of time and funding, I was trying to leverage an existing pipeline, and VADR performed the best.

I'd be happy to "pay back" the VADR project by doing any sort of model building with the newer genomes, to help the 'corona' model include these distant members. Just let me know how I can help. Is this model building described in the Wiki?

taltman · 2021-12-27T22:49:46Z

I forgot to mention: the concatenation + genomic interval arithmetic method worked!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VADR doesn't annotate second segment of segmented CoV genome #52

VADR doesn't annotate second segment of segmented CoV genome #52

taltman commented Dec 16, 2021 •

edited

Loading

nawrockie commented Dec 18, 2021

taltman commented Dec 19, 2021

taltman commented Dec 20, 2021

nawrockie commented Dec 20, 2021

taltman commented Dec 27, 2021

taltman commented Dec 27, 2021

VADR doesn't annotate second segment of segmented CoV genome #52

VADR doesn't annotate second segment of segmented CoV genome #52

Comments

taltman commented Dec 16, 2021 • edited Loading

nawrockie commented Dec 18, 2021

taltman commented Dec 19, 2021

taltman commented Dec 20, 2021

nawrockie commented Dec 20, 2021

taltman commented Dec 27, 2021

taltman commented Dec 27, 2021

taltman commented Dec 16, 2021 •

edited

Loading