-
Notifications
You must be signed in to change notification settings - Fork 24
Influenza annotation
- Most of the flu VADR models are based on FLAN reference sequences
- Changes made to covariance models (
flu.cm
)
The steps below explain how to use VADR for flu sequence validation and annotation. Testing of VADR for flu is still ongoing and these recommendations are subject to change.
GenBank is not currently using VADR to automatically process flu sequence submissions like it does for SARS-CoV-2. GenBank flu sequence submissions are currently validated and annotated using the FLAN annotation tool. Our testing indicates that VADR and FLAN give identical results on most flu sequences.
Steps for using VADR for flu annotation:
-
Download and install the latest version of VADR (v1.6.3 or later), following the instructions on this page. Alternatively, you can use the StaPH-B VADR 1.6.3 docker image created by Curtis Kapsak (docker image names:
staphb/vadr:1.6.3
andstaphb/vadr:latest
), available on dockerhub and quay. A brief README for the docker image is here. -
Download the latest flu VADR models (version 1.6.3-2, gzipped tarball) from here, unpack them (e.g.
tar xfz <tarball.gz>
). Note the path to the directory name created (<flu-models-dir-path>
) for step 3. (If you are using the docker image you can skip this step as the flu VADR models are included.) -
WARNING: the
fasta-trim-terminal-ambigs.pl
script will not exactly reproduce the trimming that the GenBank pipeline does in some rare cases, but should fix the large majority of the discrepancies you might see between local VADR results and GenBank results.To remove terminal ambiguous nucleotides from your sequence file
<input-fasta-file>
and to remove short sequences to create a new trimmed file<trimmed-fasta-file>
, execute:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 60 <input-fasta-file> > <trimmed-fasta-file>
- Run the
v-annotate.pl
program on an input trimmed fasta file with flu sequences using the recommended command and options below (the command is long so you will likely have to scroll to the right to view the entire command).
v-annotate.pl --split --cpu 8 -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu --mdir <flu-models-dir-path> <fasta-file-to-annotate> <output-directory-to-create>
This section shows output from an example v-annotate.pl
annotation of three
flu sequences GenBank using the above command and options.
The fasta file of those three sequences can be downloaded here.
(A similar example for norovirus sequences, which may contain more details on certain aspects, is here.)
For this example, the flu model directory is in /usr/local/vadr-models-flu-1.6.3-2
and the pretrim.flu.3.fa
sequence file is in the current directory. We will create a new file
flu.3.fa
with trimmed sequences in the next step.
As explained above, remove terminal ambiguous nucleotides and sequences that are shorter than 60nt with the command:
$VADRSCRIPTSDIR/miniscripts/fasta-trim-terminal-ambigs.pl --minlen 60 pretrim.flu.3.fa > flu.3.fa
Next, to annotate the trimmed sequences using the above
v-annotate.pl
options for flu, run the following command (scroll to the
right to see full command), which will create a new directory called
my3
into which VADR output files will be written.
v-annotate.pl --split --cpu 8 -r --atgonly --xnocomp --nomisc --alt_fail extrant5,extrant3 --mkey flu --mdir /usr/local/vadr-models-flu-1.6.3 flu.3.fa my3
The options used are explained further below.
When you execute the above command, you should see output similar to the following block that lists relevant environment variable values, and input arguments and options:
# v-annotate.pl :: classify and annotate sequences using a model library
# VADR 1.6.3 (Dec 2023)
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# date: Wed Dec 20 11:31:39 2023
# $VADRBIOEASELDIR: /home/nawrocki/vadr-install-1.6.3/Bio-Easel
# $VADRBLASTDIR: /home/nawrocki/vadr-install-1.6.3/ncbi-blast/bin
# $VADREASELDIR: /home/nawrocki/vadr-install-1.6.3/infernal/binaries
# $VADRINFERNALDIR: /home/nawrocki/vadr-install-1.6.3/infernal/binaries
# $VADRMODELDIR: /home/nawrocki/vadr-install-1.6.3/vadr-models-calici
# $VADRSCRIPTSDIR: /home/nawrocki/vadr-install-1.6.3/vadr
#
# sequence file: flu.3.fa
# output directory: my3
# only consider ATG a valid start codon: yes [--atgonly]
# specify that alert codes in <s> cause FAILure: extrant5,extrant3 [--alt_fail]
# .cm, .minfo, blastn .fa files in $VADRMODELDIR start with key <s>, not 'vadr': flu [--mkey]
# model files are in directory <s>, not in $VADRMODELDIR: /usr/local/vadr-models-flu-1.6.3 [--mdir]
# in feature table for failed seqs, never change feature type to misc_feature: yes [--nomisc]
# turn off composition-based for blastx statistics with -comp_based_stats 0: yes [--xnocomp]
# replace stretches of Ns with expected nts, where possible: yes [-r]
# split input file into chunks, run each chunk separately: yes [--split]
# parallelize across <n> CPU workers (requires --split or --glsearch): 8 [--cpu]
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Next, v-annotate.pl
will output information as it proceeds through different steps of the analysis:
# Validating input ... done. [ 0.4 seconds]
# Splitting sequence file into chunks to run independently in parallel on 8 processors ... done. [ 0.5 seconds]
# Executing 3 scripts in parallel on 8 processors to process 3 partition(s) of all 3 sequence(s) ...
# 3 of 3 jobs finished (0.1 minutes spent waiting)
# done. [ 10.1 seconds]
# Merging and finalizing output ... done. [ 0.9 seconds]
With the --split --cpu 8
options, the input fasta script is split up
into chunks and runs v-annotate.pl
separately on those chunks on 8
different CPUs in parallel. When all sequences are finished processing
the main script merges the output together.
The v-annotate.pl
output then includes a summary of the classification of sequences, and the alerts reported:
# Summary of classified sequences:
#
# num num num
#idx model group subgroup seqs pass fail
#--- --------- --------- -------- ---- ---- ----
1 AY191501 fluB-seg6 - 1 0 1
2 CY005979 fluA-seg4 H13 1 1 0
3 NC_036618 fluD-seg4 - 1 1 0
#--- --------- --------- -------- ---- ---- ----
- *all* - - 3 2 1
- *none* - - 0 0 0
#--- --------- --------- -------- ---- ---- ----
#
# Summary of reported alerts:
#
# alert causes short per num num long
#idx code failure description type cases seqs description
#--- -------- ------- ------------------ ------- ----- ---- -----------
1 cdsstopn yes* CDS_HAS_STOP_CODON feature 1 1 in-frame stop codon exists 5' of stop position predicted by homology to reference
2 cdsstopp yes* CDS_HAS_STOP_CODON feature 1 1 stop codon in protein-based alignment
#--- -------- ------- ------------------ ------- ----- ---- -----------
And finally a list of the output files created:
# Output printed to screen saved in: my3.vadr.log
# List of executed commands saved in: my3.vadr.cmd
# List and description of all output files saved in: my3.vadr.filelist
# esl-seqstat -a output for input fasta file saved in: my3.vadr.seqstat
# 5 column feature table output for passing sequences saved in: my3.vadr.pass.tbl
# 5 column feature table output for failing sequences saved in: my3.vadr.fail.tbl
# list of passing sequences saved in: my3.vadr.pass.list
# list of failing sequences saved in: my3.vadr.fail.list
# list of alerts in the feature tables saved in: my3.vadr.alt.list
# fasta file with passing sequences saved in: my3.vadr.pass.fa
# fasta file with failing sequences saved in: my3.vadr.fail.fa
# per-sequence tabular annotation summary file saved in: my3.vadr.sqa
# per-sequence tabular classification summary file saved in: my3.vadr.sqc
# per-feature tabular summary file saved in: my3.vadr.ftr
# per-model-segment tabular summary file saved in: my3.vadr.sgm
# per-alert tabular summary file saved in: my3.vadr.alt
# alert count tabular summary file saved in: my3.vadr.alc
# per-model tabular summary file saved in: my3.vadr.mdl
# alignment doctoring tabular summary file saved in: my3.vadr.dcr
# replaced stretches of Ns summary file (-r) saved in: my3.vadr.rpn
#
# All output files created in directory ./my3/
#
# Elapsed time: 00:00:12.12
# hh:mm:ss
#
[ok]
Note that all the output files will be in the newly created my3
directory.
The Summary of classified sequences
listed that two sequences passed and one failed.
The file my3.vadr.pass.list
, lists the two sequences that passed:
MF147925.1
ON166841.1
and my3.vadr.fail.list
lists the sequence that failed:
MH606682.1
Also, FASTA-formatted sequence files for each the passing and failing
sequences are my3.vadr.pass.fa
and my3.vadr.fail.fa
.
For the two sequences that passed, the annotation is available in the
output my3.vadr.pass.tbl
file and for the two sequences that failed the
annotation is in the my3.vadr.fail.tbl
file.
Here is the output for the first sequence in the my3.vadr.pass.tbl
file:
>Feature MF147925.1
42 1739 gene
gene HA
42 1739 CDS
product hemagglutinin
function receptor binding and fusion protein
protein_id MF147925.1_1
42 95 sig_peptide
protein_id MF147925.1_1
96 1067 mat_peptide
product HA1
protein_id MF147925.1_1
1068 1736 mat_peptide
product HA2
protein_id MF147925.1_1
>Feature ON166841.1
<1 1936 gene
gene HEF
<1 1936 CDS
product hemagglutinin-esterase precursor
codon_start 2
protein_id ON166841.1_1
And the sequence in the my3.vadr.fail.tbl
:
>Feature MH606682.1
35 337 gene
gene NB
35 337 CDS
product NB protein
protein_id MH606682.1_1
42 1442 gene
gene NA
42 1442 CDS
product neuraminidase
protein_id MH606682.1_2
Additional note(s) to submitter:
ERROR: CDS_HAS_STOP_CODON: (CDS:NB protein) in-frame stop codon exists 5' of stop position predicted by homology to reference [TGA, shifted S:108,M:108]; seq-coords:227..229:+; mdl-coords:239..241:+; mdl:AY191501;
ERROR: CDS_HAS_STOP_CODON: (CDS:NB protein) stop codon in protein-based alignment [-]; seq-coords:227..229:+; mdl-coords:239..241:+; mdl:AY191501;
Feature table format is described at https://www.ncbi.nlm.nih.gov/WebSub/html/help/feature-table.html.
Note that the end of the fail.tbl
file lists ERRORs for
MH606682.1
. In this case, the NB protein coding region has a stop
codon 108 nucleotides earlier than expected. To investigate issues
like these further, it can be helpful to add the --keep
option to
v-annotate.pl
which results in additional output files being
created, including alignment files.
ERROR
lines such as this are meant to highlight potential problems
for manual review or regions of interest to the user, but they do not
necessarily mean that there is a problem with the sequence.
The annotation information is also available in other files with
different formats, such as the my3/my3.vadr.ftr
file, which may be
easier to parse for some applications.
For examples of alerts/errors and more information on how to interpret the
VADR output related to those alerts in the output feature tables, .alt.list
file and the GenBank submission portal detailed error report .tsv
files, see this vadr documentation page.
See the vadr README documentation section for more information on how to interpret VADR results and output, including information on file formats.
You can find information on two papers on VADR (below).
The options used in the above command are the recommended set of options for flu annotation. The options are each briefly explained in the table below. More information can be found here,
option | explanation |
---|---|
--split |
split input file into chunks of about 300Kb and run each chunk separately (300Kb can be changed to <n> by adding option --nkb <n>
|
--cpu 8 |
for input sequence files > 300Kb, run multi-threaded by parallelizing across up to 8 CPU workers (8 can be changed to <n1> with --cpu <n1> ), requires --split
|
-r |
turn on the replace-N strategy: replace stretches of Ns with expected nucleotides, where possible |
--atgonly |
specify that 'ATG' is the only start codon that should be considered valid |
--xnocomp |
turn off composition-based stats for blastx, which seems to increase the length of flu blastx alignments in some cases |
--nomisc |
specify that features for failing sequences not be changed to misc_feature s in the output .tbl file |
--alt_fail extrant5,extrant3 |
specify that extrant5 and extrant3 alerts which are reported when one or more additional nucleotides compared with the reference is detected at the 5' or 3' ends, cause a sequence to fail |
--mkey flu |
use the model files with prefix flu in the directory from --mdir
|
--mdir /usr/local/vadr-models-flu-1.6.3-2 |
specify that the models to use are in directory /usr/local/vadr-models-flu-1.6.3-2 |
The -r
option is also used for SARS-CoV-2 annotation with VADR, for more information on the option in that context, see
this page
As of December 20, 2023, the VADR model library for flu annotation
(vadr-models-flu-1.6.3-2
model
file)
includes 70 flu models covering influenza A, B, C and D types.
For each type/segment/subtype, the table below lists the accessions the models are derived from (70 total accessions) and the gene and CDS product names for each.
type | seg | subtype | accession | gene | CDS product |
---|---|---|---|---|---|
A | 1 | - |
CY002079.1 , CY103881.1 , CY125942.1
|
PB2 | polymerase PB2 |
A | 2 | - |
CY003646.1 , CY103882.1 , CY125943.1
|
PB1, PB1-F2 | polymerase PB1, PB1-F2 protein |
A | 3 | - |
CY003645.1 , CY103883.1 , CY125944.1
|
PA, PA-X | polymerase PA, PA-X protein |
A | 4 | H1 | CY000449.2 |
HA | hemagglutinin |
A | 4 | H2 | CY003907.1 |
HA | hemagglutinin |
A | 4 | H3 | CY002000.1 |
HA | hemagglutinin |
A | 4 | H4 | CY004847.1 |
HA | hemagglutinin |
A | 4 | H5 | DQ864721.1 |
HA | hemagglutinin |
A | 4 | H6 | DQ376635.1 |
HA | hemagglutinin |
A | 4 | H7 | CY006037.1 |
HA | hemagglutinin |
A | 4 | H8 | CY005970.1 |
HA | hemagglutinin |
A | 4 | H9 | CY004642.1 |
HA | hemagglutinin |
A | 4 | H10 | CY006001.1 |
HA | hemagglutinin |
A | 4 | H11 | CY006005.1 |
HA | hemagglutinin |
A | 4 | H12 | CY006008.1 |
HA | hemagglutinin |
A | 4 | H13 | CY005979.1 |
HA | hemagglutinin |
A | 4 | H14 | M35997.1 |
HA | hemagglutinin |
A | 4 | H15 | CY006034.1 |
HA | hemagglutinin |
A | 4 | H16 | AY684891.1 |
HA | hemagglutinin |
A | 4 | H17 | CY103884.1 |
HA | hemagglutinin |
A | 4 | H18 | CY125945.1 |
HA | hemagglutinin |
A | 4 | H19 | ON637239.1 |
HA | hemagglutinin |
A | 5 | - |
CY006079.1 , CY103885.1 , CY125946.1
|
NP | nucleocapsid protein |
A | 6 | N1 | CY002538.1 |
NA | neuraminidase |
A | 6 | N2 | CY002010.1 |
NA | neuraminidase |
A | 6 | N3 | CY005890.1 |
NA | neuraminidase |
A | 6 | N4 | CY005359.1 |
NA | neuraminidase |
A | 6 | N5 | CY004429.1 |
NA | neuraminidase |
A | 6 | N6 | CY005641.1 |
NA | neuraminidase |
A | 6 | N7 | CY004435.1 |
NA | neuraminidase |
A | 6 | N8 | CY004056.1 |
NA | neuraminidase |
A | 6 | N9 | CY004131.1 |
NA | neuraminidase |
A | 6 | N10 | CY103886.1 |
NA | neuraminidase |
A | 6 | N11 | CY125947.1 |
NA | neuraminidase-like protein |
A | 7 | - |
CY002009.1 , CY103887.1 , CY125948.1
|
M2, M1 | matrix protein 2, matrix protein 1 |
A | 8 | - |
CY002284.1 , CY103888.1 , CY125949.1
|
NEP, NS1 | nuclear export protein, nonstructural protein 1 |
B | 1 | - | EF626642.1 |
PB1 | polymerase PB1 |
B | 2 | - | AY504599.1 |
PB2 | polymerase PB2 |
B | 3 | - | EF626633.1 |
PA | polymerase PA |
B | 4 | - | AF387493.1 |
HA | hemagglutinin |
B | 5 | - | EF626631.1 |
NP | nucleoprotein |
B | 6 | - | AY191501.1 |
NB, NA | NB protein, neuraminidase |
B | 7 | - | AY504605.1 |
M1, BM2 | matrix protein 1, BM2 protein |
B | 8 | - | AY504614.1 |
NEP, NS1 | nuclear export protein, nonstructural protein 1 |
C | 1 | - | NC_006307.2 |
PB2 | polymerase PB2 |
C | 2 | - | NC_006308.2 |
PB1 | polymerase PB1 |
C | 3 | - | NC_006309.2 |
P3 | polymerase P3 |
C | 4 | - | NC_006310.2 |
HE | hemagglutinin-esterase |
C | 5 | - | NC_006311.1 |
NP | nucleoprotein |
C | 6 | - | NC_006312.2 |
M1, CM2 | matrix protein 1, CM2 protein |
C | 7 | - | NC_006306.2 |
NEP, NS1 | nonstructural protein 2, nonstructural protein 1 |
D | 1 | - | NC_036616.1 |
PB2 | polymerase PB2 |
D | 2 | - | NC_036615.1 |
PB1 | polymerase PB1 |
D | 3 | - | NC_036619.1 |
P3 | polymerase 3 |
D | 4 | - | NC_036618.1 |
HEF | hemagglutinin-esterase precursor |
D | 5 | - | NC_036617.1 |
NP | nucleoprotein |
D | 6 | - | NC_036620.1 |
P42 | P42 |
D | 7 | - | NC_036621.1 |
NS2, NS1 | nonstructural protein 2, nonstructural protein 1 |
Currently, and for the past several years, influenza sequence
submissions to GenBank are automatically processed by
FLAN (FLu
ANotation tool), with reference below. The VADR flu
models (v1.6.3-2) were derived from the FLAN reference sequences with
some changes and additions to improve VADR performance. There is more
information on this in the 00NOTES.txt
and
mapping-flan-to-genbank/00NOTES.txt
files included in the
vadr-models-flu-1.6.3-2.tar.gz
gzipped tarball. We are also
preparing a manuscript describing the use of VADR for flu annotation
that will include a discussion of how VADR models were created based
on FLAN, and a comparison of flu annotation results using the two
tools.
For two models, additional steps (besides running VADR's v-build.pl
program) were needed to create optimally performing covariance models in the flu.cm
file. The model tarball file includes the two .stk
files used to
build those models. These alignments were used as input the the
cmbuild
program of Infernal to create the models in flu.cm
:
alignment file | explanation |
---|---|
CY006079.2.stk |
training alignment used to build the CY006079 (fluA-seg5) model in flu.cm. Includes 2 sequences CY006079.1 and CY006079.1-1542-ins-a, which is identical to CY006079.1 with an insertion of a single 'a' after position 1542. Building a model from this two sequence alignment instead of only CY006079 results in a model less likely to get spurious alerts related to the stop codon of the NP CDS. |
CY005970.2.stk |
training alignment used to build the CY005970 (fluA-seg4) model in flu.cm. Includes 2 sequences CY005970.1 and CY005970.1-g1719a-1721del, which is identical to CY005970.1 with a substitution at position 1719 ('a' for 'g') and a deletion at position 1721. Building a model from this two sequence alignment instead of only CY005970 results in a model less likely to give an alert for an invalid stop codon (e.g. LC339539.1). |
- VADR README
- VADR installation instructions
-
v-build.pl
example usage and command-line options -
v-annotate.pl
example usage, command-line options and alert information - Advanced tutorial: building an RSV model library
-
Explanations and examples of
v-annotate.pl
detailed alert and error messages- Output fields with detailed alert and error messages
- Explanation of sequence and model coordinate fields in
.alt
files toy50
toy model used in examples of alert messages- Examples of different alert types and corresponding
.alt
output - Posterior probability annotation in VADR output Stockholm alignments
- VADR output file formats
- Available VADR model files (github wiki)
- SARS-CoV-2 annotation (github wiki)
- Rfam-based structural annotation of a viral genome sequence for use with VADR (github wiki)
- Development notes and instructions (github wiki)
-
There is a preprint on using VADR for influenza annotation on bioRxiv: Vincent C Calhoun, Eneida L Hatcher, Linda Yankie, Eric P Nawrocki; Influenza sequence validation and annotation using VADR. https://www.biorxiv.org/content/10.1101/2024.03.21.585980v1
-
The recommended citation for using VADR is: Alejandro A Schäffer, Eneida L Hatcher, Linda Yankie, Lara Shonkwiler, J Rodney Brister, Ilene Karsch-Mizrachi, Eric P Nawrocki; VADR: validation and annotation of virus sequence submissions to GenBank. BMC Bioinformatics 21, 211 (2020). https://doi.org/10.1186/s12859-020-3537-3
-
The following article describes changes made to VADR for faster SARS-CoV-2 annotation, including the
-r
option: Eric P Nawrocki; Faster SARS-CoV-2 sequence validation and annotation for GenBank using VADR. NAR Genom Bioinform. Vol 5, Issue 1 (2023) https://doi.org/10.1093/nargab/lqad002 -
FLAN reference: Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Tatusova T. FLAN: a web server for influenza virus genome annotation. Nucleic Acids Res. 2007 Jul;35(Web Server issue):W280-4. doi:10.1093/nar/gkm354. Epub 2007 Jun 1. PMID: 17545199; PMCID: PMC1933127.