-
Notifications
You must be signed in to change notification settings - Fork 10
why
one of the first steps after variant-calling in many pipelines is filtering on allele-frequency. This requires annotating with large datasets (for example, gnomad genomes is over 1TB of data). Echtvar uses integer compression, variant encoding and genomic chunking to make this stupid fast.
To make this simpler, smaller, and faster, echtvar
encodes and compresses the variant, allele-frequency and other (user-specified) columns
from a population VCF/BCF into an efficient format. This enables rapid annotation. In our tests, echtvar
can annotate at
~1 million variants / second, but this is highly dependent on disk speed.
slivar has a similar feature as echtvar
. It has the following
limitations that echtvar
overcomes.
- slivar reads each chromosome into memory. This can make memory use quite high when there are many attributes and many variants (for example with CADD, which has 3 variants per genomic location).
- slivar only uses general-purpose gzip (zlib) compression.
- it uses 64 bits for small variants with overflow to a text table.
echtvar
uses 32 bits for small variants with overflow to an efficient binary format.
In our experience, an echtvar
file will be about 60-70% of the size of the corresponding slivar
encoded file. And, echtvar
is substantially (often 5X) faster than slivar
.
other tools like vcfanno, bcftools annotate and snpSift can annotate a query VCF with one or more VCFs. Each of these must parse much of the original (often huge) annotation files and so speed is limited by parsing of the annotation files.
other tools like bcftools, snpSift, and slivar support filtering expressions. The expressions
in echtvar
are stupid fast. In fact, it is often faster to apply an expression because writing to disk
is the bottleneck and an expression will filter such that fewer variants are written to disk.
Other tools, especially slivar
provide more flexible and complete filtering. The intent with echtvar
is
to cover most common use-cases with extreme speed. This is done with the fasteval rust library.