Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VADR doesn't annotate second segment of segmented CoV genome #52

Open
taltman opened this issue Dec 16, 2021 · 6 comments
Open

VADR doesn't annotate second segment of segmented CoV genome #52

taltman opened this issue Dec 16, 2021 · 6 comments

Comments

@taltman
Copy link

taltman commented Dec 16, 2021

The Serratus Project expanded the set of known CoV/nidovirus genomes, including segmented ones. An example of a segmented nidovirus similar to the ones that we found is the Pacific salmon nidovirus (MK611985.1). Please see Figure 3 of our preprint for more context:
https://www.biorxiv.org/content/10.1101/2020.08.07.241729v2

When I try to annotate the AmexNV genome, with two segments in the input FASTA file, VADR 1.3 annotates the first segment, and then reports the following for the second one:

>Feature NODE_11_length_12596_cov_95.354468

Additional note(s) to submitter:
ERROR: NO_ANNOTATION: (*sequence*) no significant similarity detected [-]; seq-coords:-; mdl-coords:-; mdl:-;

Yet, when I concatenate the two contigs with a run of 16 Ns: I get additional annotations (see below). Is there a way for VADR to recognize the multiple segments, and annotate them individually? (see below for the input files used)

Additional annotations:

22167   27672   gene
                        gene    S
22167   27672   CDS
                        product spike glycoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_3
27717   28212   gene
                        gene    orf4
27717   28212   CDS
                        product non-structural protein
                        protein_id      NODE_3_length_19124_cov_65.568632_4
28193   28627   gene
                        gene    E
28193   28627   CDS
                        product small membrane protein
                        protein_id      NODE_3_length_19124_cov_65.568632_5
28639   29602   gene
                        gene    M
28639   29602   CDS
                        product membrane glycoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_6
29646   31439   gene
                        gene    N
29646   31439   CDS
                        product nucleocapsid phosphoprotein
                        protein_id      NODE_3_length_19124_cov_65.568632_7
29665   30389   gene
                        gene    N2
29665   30389   CDS
                        product nucleocapsid phosphoprotein 2
                        protein_id      NODE_3_length_19124_cov_65.568632_8

Original FASTA file with two segments:
SRR6788790.epsy.fa.txt

Modified FASTA with the two segments concatenated:
AmexNV-one-contig-test.fa.txt

@nawrockie
Copy link
Member

@taltman can you please provide the v-annotate.pl command you used?

@taltman
Copy link
Author

taltman commented Dec 19, 2021

@nawrockie , here is the darth.sh script that I use to call v-annotate.pl:

https://bitbucket.org/tomeraltman/darth/src/dev/src/darth.sh

The relevant section:

    v-annotate.pl \
	--mdir $data_dir/vadr-models-corona-1.3-3 \
	--mkey corona \
	--mxsize 64000 \
	-f \
	--keep \
	--nomisc \
	$output_parent_dir/transeq/canonical.fna \
	$output_dir \

Also, as a reminder the Docker container for testing out the environment that I'm using is taltman/darth:tyranus. The Dockerfile in the above Bitbucket repo shows how I set things up.

Thanks for any insights you might be able to share!

@taltman
Copy link
Author

taltman commented Dec 20, 2021

I guess I could do the following:

  1. Concatenate the two segments
  2. Run VADR
  3. Do some interval arithmetic to adjust the annotations on the second segment

It's not too hard, but I was hoping there might be a more elegant way to do it. Please let me know if you have any suggestions. Thank you!

@nawrockie
Copy link
Member

Thanks for providing the command that you used.
I was able to reproduce your results and I took a closer
look. The reason you get the NO_ANNOTATION error for the second
sequence in your SRR6788790.epsy.fa.txt file is because that sequence
is not recognized as homologous (based on sequence similarity)
by vadr to any of the coronavirus RefSeq sequences. The first sequence
is recognized as homologous though, so when you concatenate them the
homology from the first sequence in the concatenated sequence makes the
concatenated sequence get recognized as homologous.

After recognizing that the concatenated sequence is homologous
VADR then aligns the entire sequence to the best matching model
and infers annotation based on that alignment. This is why you get
additional annotation that extends beyond the region of homology
(to the spike protein, etc.) in the region of the second sequence.

That region doesn't match well to the model, but the alignment has been
forced and the inferred positions of all the features from the RefSeq
are reported in the .ftr and .tbl files. Note that there are many
serious alerts/errors reported for these features (early stop codons,
mutated start codons, frameshifts, lack of protein homology from
blastx (indefant)) indicating that the annotations for these features
are not trustworthy.

Unfortunately, if you want to force vadr to annotate these seqs
and you probably will need to manually concatenate them.

I see that you are trying to do remote homology detection/annotation
but using vadr with the coronavirus RefSeq models is very limited in
this respect because it works primarily in nucleotide space. It was
really only designed for annotating sequences highly similar to the
model RefSeq. You're just coming up against the limit of the capability of
the tool.

That said, the best way to maximize the remote homology detection
ability of vadr would be to create a diverse multiple sequence alignment of
coronavirus genomes and make a single profile model from that
alignment. You'd also have to know the feature positions in the
reference coordinate space to put in the .minfo file.

However, the power of this nucleotide profile for remote homology
detection would still be far less than any protein-based homology
search method.

@taltman
Copy link
Author

taltman commented Dec 27, 2021

Thanks for your reply. Yes, we are using VADR to annotate CoV genomes that are very distant from the known CoV genomes in GenBank. I agree with your described approach ( we in fact used protein-space searches to find these distant CoV genomes). But due to lack of time and funding, I was trying to leverage an existing pipeline, and VADR performed the best.

I'd be happy to "pay back" the VADR project by doing any sort of model building with the newer genomes, to help the 'corona' model include these distant members. Just let me know how I can help. Is this model building described in the Wiki?

@taltman
Copy link
Author

taltman commented Dec 27, 2021

I forgot to mention: the concatenation + genomic interval arithmetic method worked!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants