VIRUSBreakend is a high-speed viral integration detection tool designed to be incorporated in the whole genome sequence piplines with minimal additional cost. VIRUSBreakend takes on average 1 hour to run on a 100x coverage human sample. Recommended job/VM size is 4 core / 64Gb memory.
This tool is part of GRIDSS - the Genomic Rearrangement IDentification Software Suite.
Please cite:
-
VIRUSBreakend: https://doi.org/10.1093/bioinformatics/btab343
-
GRIDSS2: harnessing the power of phasing and single breakends in somatic structural variant detection https://www.biorxiv.org/content/10.1101/2020.07.09.196527v1
VIRUSBreakend is part of the GRIDSS software suite. It can be downloaded from https://github.com/PapenfussLab/gridss/releases
All tools used by VIRUSBreakend must be on PATH
including:
- java
- GRIDSS
- Kraken2
- RepeatMasker
- samtools
- bcftools
- bwa
Set the GRIDSS_JAR
environment variable to the location of the GRIDSS jar file
The VIRUSBreakend reference data can be downloaded from AWS: https://virusbreakend.s3.us-east-2.amazonaws.com/virusbreakenddb_20210401.tar.gz
The VIRUSBreakend reference data can also be downloaded from https://melbourne.figshare.com/articles/dataset/virusbreakenddb_20210401_tar_gz/14782299
VIRUSBreakend uses a standard Kraken2 database augmented with NCBI viral neighbour genomes and associated metadata files.
The reference file database contains the following files:
file | description |
---|---|
library/viral/library.fna | NCBI RefSeq viral genomes |
library/added/*.fna | NCBI viral neighbour genomes |
*.fai | samtools index of associated .fna file |
*.fna.dict | Picard tools sequence dictionary of associated .fna file |
hash.k2d | Kraken2 database file |
taxo.k2d | Kraken2 database file |
opts.k2d | Kraken2 database file |
taxonomy/nodes.dmp | Snapshot of NCBI taxonomy nodes at time of database creation |
taxonomy/names.dmp | Snapshot of NCBI taxonomy names at time of database creation. Not technically required by VIRUSBreakend but useful for looking up the name of the assigned taxid |
seqid2taxid.map | Kraken2 mapping of sequences to NCBI taxonomy IDs |
taxid10239.nbr | Table of viral neighbour genomes and their associated RefSeq genome |
The exact sources for these files can be determined from the paths and queries in virusbreakend-build
.
Run virusbreakend-build --db virusbreakenddb
to download and generate the reference data.
This download the NCBI taxonomic information, sequences, virushostdb, the kraken2-build
build process, and generates indexes.
The index is around 54GB in size.
Be aware that the kraken2 build process requires around 150GB of intermediate disk space to download from NCBI and build the index.
WARNING: downloading from NCBI fails regularly. It is likely you will have to manually rerun each failing step in virusbreakend-build
to get a functioning database.
virusbreakend-build
requires:
- An internet connection
- GRIDSS
- samtools
- Kraken2
- dustmasker (part of the NCBI BLAST+ package)
- EDirect utilities
These can be installed manually, or the BioConda samtools
kraken2
and entrez-direct
packages can be used.
The the generated directory (virusbreakenddb
) contains the files used in the build process.
The file virusbreakend.db.virusbreakenddb.tar.gz
contains the minimal subset of the virusbreakenddb
data required by VIRUSBreakend.
To run VIRUSbreakend, ensure that the database has been build and run the following command:
virusbreakend \
--kraken2db virusbreakenddb \
--output sample.virusbreakend.vcf \
--reference host_reference.fa \
sample.bam
Some reference genomes include viral sequence. hs37d5, hs38DH, and GRCh38_full_analysis_set_plus_decoy_hla all include EBV. For VIRUSBreakend to report EBV infection, reads aligned to the EBV sequence ( or chrEBV)
By default, VIRUSBreakend considered any read mapped if it maps to chrEBV
or any viral reference sequence in the VIRUSBreakend database (library/viral/*.fna.fai
or library/added/*.fna.fai
).
When checking for matches, VIRUSBreakend uses the exact name, the kraken stripped name, and also the version stripped named.
For example, the kraken2 database contains kraken:taxid|10376|NC_007605.1
so VIRUSBreakend will exclude kraken:taxid|10376|NC_007605.1
, NC_007605.1
, and NC_007605
(thus picking up the NC_007605
EBV reference in hs37d5.fa
)
A custom list can be supplied with the --viralreferences
command line parameter.
VIRUSBreakend outputs:
- A VCF containing the integration breakpoints
- The kraken2 report of the virus(es) for which viral integration was run upon
- Coverage statistics of the virus(es) for which viral integration was run upon
field | meaning |
---|---|
taxid_genus | NCBI taxonomy ID of genus of viral reference |
name_genus | NCBI taxonomy genus name |
reads_genus_tree | Number of reads assigned to any virus in that genus by kraken2 |
taxid_species | NCBI taxonomy ID of species of viral reference |
reads_species_tree | Number of reads assigned to that species (and any sub-species) by kraken2 |
name_species | NCBI taxonomy species name |
taxid_assigned | NCBI taxonomy ID of viral reference |
name_assigned | NCBI taxonomy name of viral reference |
reads_assigned_tree | Number of reads assigned to the viral reference or taxonomic descendents by kraken2 |
reads_assigned_direct | Number of reads assigned to the viral reference taxid by kraken2 |
reference | name of viral reference used. |
reference_taxid | NCBI taxonomy ID of viral reference used. This can be a child taxid of taxid_assigned. |
reference_kmer_count | count of read kmers matching reference used |
alternate_kmer_count | count of read kmers matching next best reference candidate |
#rname | name of adjusted viral reference |
startpos | start position of adjusted viral reference. Always 1 |
endpos | end position of adjusted viral reference. Always equal to the viral contig length |
numreads | Number of reads mapped to adjusted viral reference |
covbases | Number of bases with at least 1 read mapped |
coverage | Percentage of viral sequence with at least one read mapped |
meandepth | Mean alignment depth |
meanbaseq | Mean base quality of bases mapped to adjusted viral reference |
meanmapq | Mean mapping quality of reads mapped to adjusted viral reference |
integrations | Number of integration breakpoints found. |
QCStatus | QC status of viral integration |
Each viral integration should have 2 integration breakpoints (one for the start, one for the end) although there may be more if the integration site is rearranged or less if one side of the integration site was missed.
QCStatus can be one of the following:
-
"" The empty string indicates no warning or errors associated with this virus
-
LOW_VIRAL_COVERAGE
Less than 10% of the virus has any coverage. This typically occurs when the viral sequence has a short sequence homology with non-viral sequence. This record is likely a false positive and should be ignored.
- EXCESSIVE_VIRAL_COVERAGE
Regions of the virus have depth of coverage so high, that GRIDSS was unable to call integrations in these regions.
The exact bounds of excluded regions can be found in the *.coverage.blacklist.bed
files in the virusbreakend.working
directory
- ASSEMBLY_DOWNSAMPLED
Viral coverage was sufficiently high that some regions were downsampled in the GRIDSS assembly process.
With the assembly downsampling changes in GRIDSS 2.11.1, viral integration have in these region have probably been correctly called but the QUAL score is lower than it should be.
The exact bounds of excluded regions can be found in the *.bed.excluded_*.bed
files in the virusbreakend.working
directory
- CHILD_TAXID_REFERENCE
The viral genome chosen by VIRUSBreakend does not match the NCBI taxonomic classification assigned by Kraken2.
This typically occurs in scenarios such as HPV-45 where Kraken2 assigns reads to the parent taxon due to sequence commonality between HPV-45 and HPV-18.
This is expected behavour for these viral taxa. The taxon assigned reference_taxid
- UNCLEAR_TAXID_ASSIGNMENT
Less than 60% of reads assigned to nominal kraken2 assigned taxon were directly assigned (the rest were assigned to child taxa). The taxonomic assignment may be unclear - possibly due to co-infection by multiple viruses within the genus.
The key differentiator of VIRUSBreakend is the ability to detect and classify integration sites in repetative sequences such as centromeres.
Due to the repetative nature of these region, such integration sites cannot be unambigously placed in the host genome.
In such cases, the mapq encoded in the BEALN
field will be 0 and the field may contain multiple candidicate integration sites.
Integration sites in which the reported position is ambiguous have a LOW_MAPQ
FILTER applied.
The INSRM
field contains the repeat sequences identifed in the integration site host sequences.
These annotations can be used to classify ambigous integration sites.
By default, VIRUSBreakend filters out potential integrations with signficant overlap with simple or low complexity repeats since such integrations are typically false positives.
If a more sensitive call set is required, this filter can be removed by rerunning the final VirusBreakendFilter
steps with the MINIMUM_REPEAT_OVERLAP
parameter set to any value greater than one (ie require more than 100% repeat overlap).