DeepVariant is a set of programs used to transform aligned sequencing reads into variant calls. At the highest level, a user needs to provide three inputs:
-
A reference genome in FASTA format and its corresponding .fai index file generated using the
samtools faidx
command. -
An aligned reads file in BAM format and its corresponding index file (.bai). The reads must be aligned to the reference genome described above.
-
A model checkpoint for DeepVariant.
The output of DeepVariant is a list of all variant calls in VCF format.
DeepVariant is composed of three programs: make_examples
, call_variants
, and
postprocess_variants
. Each program's function and example usage is described
in detail in the quick start.
- Sharded files are a single logical collection of files with a common naming
convention. For example, we talk about
filename@10
as a single 10-way sharded file namedfilename
. On most filesystems this actually looks like 10 distinct filesfilename-00000-of-00010
, ...,filename-00009-of-00010
. DeepVariant can write sharded files using theirfilename@10
-style name and can read sharded files using both that style as well as the glob form, such asfilename-*
orfilename-*-of-00010
. - Files with the
.gz
suffix are interpreted as being compressed with gzip and are read/written accordingly. - All of the DeepVariant tools can read/write their inputs/outputs from local disk as well as directly from a cloud storage bucket (through gcsfuse). In the whole genome case study and exome case study, we didn't use gcsfuse because there is not much benefit on a single machine compared to just copying the files first.
make_examples
consumes reads and the reference genome to create TensorFlow
examples for evaluation with our deep learning models. The tf.Example protos are
written out in TFRecord format.
make_examples
is a single-threaded program using 1-2 GB of RAM. Since the
process of generating examples is embarrassingly parallel across the genome,
make_examples
supports sharding of its input and output via the --task
argument with a sharded output specification. For example, if the output is
specified as --examples examples.tfrecord@10.gz
and --task 0
, the input to
the program will be 10% of the regions and the output will be written to
examples.tfrecord-00000-of-00010.gz
. A concrete example of using multiple
processes and sharded data is given in the quick start.
make_examples
requires its input files to satisfy a few basic requirements to
be processed correctly.
First, the reference genome FASTA, passed in using the --ref
flag, must be
indexed and can either be uncompressed or compressed with bgzip.
Second, the BAM file provided to --reads
should be aligned to a "compatible"
version of the genome reference provided as the --ref
. By compatible here we
mean the BAM and FASTA share at least a reasonable set of common contigs, as
DeepVariant will only process contigs shared by both the BAM and reference. As
an example, suppose you have a BAM file mapped to b37 + decoy FASTA and you
provide just the vanilla b37 fasta to make_examples
. DeepVariant will only
process variants on the shared contigs, effectively excluding the hs37d5 contig
present in the BAM but not in the reference.
The BAM file must be also sorted and indexed. It must exist on disk, so you cannot pipe it into DeepVariant. We currently recommend that the BAM be duplicate marked, but it's unclear if this is even necessary. Finally, it's not necessary to recalibrate the base qualities or do any form of indel realignment.
Third, if you are providing --regions
or other similar arguments these should
refer to contigs present in the reference genome. These arguments accept
space-separated lists, so all of the follow examples are valid arguments for
--regions
or similar arguments:
--regions chr20
=> only process all of chromosome 20--regions chr20:10,000,000-11,000,000
=> only process 10-11mb of chr20--regions "chr20 chr21"
=> only process chromosomes 20 and 21
Fourth and finally, if running in training mode the truth_vcf
and
confident_regions
arguments should point to VCF and BED files containing the
true variants and regions where we are confident in our calls (i.e., calls
within these regions and not in the truth_vcf are considered false positives).
These should be bgzipped and tabix indexed and be on a reference consistent with
the one provided with the --ref
argument.
call_variants
consumes TFRecord file(s) of tf.Examples protos created by
make_examples
and a deep learning model checkpoint and evaluates the model on
each example in the input TFRecord. The output here is a TFRecord of
CallVariantsOutput protos. call_variants
doesn't directly support sharding its
outputs, but accepts a glob or shard-pattern for its inputs.
call_variants
uses around 4 GB per process and uses TensorFlow for evaluation.
When evaluating a model in CPU mode, TensorFlow can make use of multiple cores,
but scaling is sub-linear. In other words, call_variants
on a 64 core machine
is less than 8x faster than running on a 8 core machine.
When using a GPU, call_variants
is both faster, more efficient, and needs
fewer CPUs. Based on a small number of experiments, currently the most efficient
configuration for call_variants
on a GPU instance is 4-8 CPUs and 1 GPU.
Compared to our setting in the whole genome case study, we noticed a 2.5x
speedup on the call_variants step using a single K80 GPU and 8 CPUs. Note that
currently call_variants
can only use one GPU at most. So it doesn't improve
the speed if you get a multiple-GPU machine.
More speed and cost details when running DeepVariant with GPU can be found on Running DeepVariant on Google Cloud Platform. The following steps show what to do if you want to manually create an instance with GPU. In order to run with GPU, here are a few changes you want to make in addition to the instructions in the whole genome case study or the exome case study.
-
To use GPUs, please read the GPUs on Compute Engine page first. For example, you'll need to request GPU quota first.
-
Request a machine with GPU.
If you want to use command line to start a machine, here is an example we used:
gcloud beta compute instances create "deepvariant-gpu-run" \ --scopes "compute-rw,storage-full,cloud-platform" \ --image-family "ubuntu-1604-lts" --image-project "ubuntu-os-cloud" \ --machine-type "n1-standard-8" --boot-disk-size "300" \ --boot-disk-type "pd-ssd" \ --boot-disk-device-name "deepvariant-gpu-run" \ --zone "us-west1-b" \ --accelerator type=nvidia-tesla-k80,count=1 \ --maintenance-policy "TERMINATE"
-
When setting up DeepVariant, make sure you run
run-prereq.sh
withDV_GPU_BUILD=1
andDV_INSTALL_GPU_DRIVERS=1
.DV_GPU_BUILD
gets you a GPU-enabled TensorFlow build, andDV_INSTALL_GPU_DRIVERS
installs the GPU drivers.DV_GPU_BUILD=1 DV_INSTALL_GPU_DRIVERS=1 bash run-prereq.sh
This helps you set up an environment with GPU-enabled TensorFlow on Ubuntu 16. You won't need
DV_INSTALL_GPU_DRIVERS
if you already installed and configured your machine with GPU capabilities. You can read NVIDIA requirements to run TensorFlow with GPU support for more details. -
(Optional) You can install and run
nvidia-smi
to confirm that your GPU drivers are correctly installed.sudo apt-get -y install nvidia-smi
After you install, run
nvidia-smi
to make sure that your GPU drivers are set up properly before you proceed.
postprocess_variants
reads all of the output TFRecord files from
call_variants
, sorts them, combines multi-allelic records, and writes out a
VCF file. When gVCF output is requested, it also
outputs a gVCF file which merges the VCF with the non-variant sites.
Because postprocess_variants
combines and sorts the output of
call_variants
, it needs to see all of the outputs from call_variants
for a
single sample to merge into a final VCF. postprocess_variants
is
single-threaded and needs a non-trivial amount of memory to run (20-30 GB), so
it is best run on a single/dual core machine with sufficient memory.
The DeepVariant team has been hard at work since we first presented the method. Key changes and improvements include:
- Rearchitected with open source release in mind
- Built on TensorFlow
- Increased variant calling accuracy, especially for indels
- Vastly faster with reduced memory usage
We have made a number of improvements to the methodology as well. The biggest change was to move away from RGB-encoded (3-channel) pileup images and instead represent the aligned read data using a multi-channel tensor data layout. We currently represent the data as a 6-channel raw tensor in which we encode:
- The read base (A, C, G, T)
- The base's quality score
- The read's mapping quality score
- The read's strand (positive or negative)
- Does the read support the allele being evaluated?
- Does the base match the reference genome at this position?
These are all readily derived from the information found in the BAM file encoding of each read.
Additional modeling changes were to move to the inception-v3 architecture and to train on many more independent sequencing replicates of the ground truth training samples, including 50% downsampled versions of each of those read sets. In our testing this allowed the model to better generalize to other data types.
In the end these changes reduced our error rate by more than 50% on the held out evaluation sample (NA24385 / HG002) as compared to our results in the PrecisionFDA Truth Challenge:
DeepVariant April 2016 (HG002, GIAB v3.2.2, b37):
Type | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|
INDEL | 4175 | 2839 | 0.987882 | 0.991728 | 0.989802 |
SNP | 1689 | 832 | 0.999447 | 0.999728 | 0.999587 |
DeepVariant December 2017 (HG002, GIAB v3.2.2, b37):
Type | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|
INDEL | 2384 | 1811 | 0.993081 | 0.994954 | 0.994017 |
SNP | 735 | 363 | 0.999759 | 0.999881 | 0.999820 |
See the whole genome case study, which we update with each release of DeepVariant, for the latest results.
You can also see the Datalab example to see how you can visualize the pileup images.
For the models we've released over time, here are more details of the training data we used.
For even more details, see DeepVariant training data.
version | Replicates | #examples |
---|---|---|
v0.4 | 9 HG001 | 85,323,867 |
v0.5 | 9 HG001 2 HG005 78 HG001 WES 1 HG005 WES(1) |
115,975,740 |
v0.6 | 10 HG001 PCR-free 2 HG005 PCR-free 4 HG001 PCR+ |
156,571,227 |
v0.7 | 10 HG001 PCR-free 2 HG005 PCR-free 4 HG001 PCR+ |
158,571,078 |
version | Replicates | #examples |
---|---|---|
v0.5 | 78 HG001 WES 1 HG005 WES |
15,714,062 |
v0.6 | 78 HG001 WES 1 HG005 WES(2) |
15,705,449 |
v0.7 | 78 HG001 WES 1 HG005 WES |
15,704,197 |
(1): In v0.5, we experimented with adding whole exome sequencing data into training data. In v0.6, we took it out because it didn't improve the WGS accuracy.
(2): The training data are from the same replicates as v0.5. The number of examples changed because of the update in haplotype_labeler.
For educational purposes, the DeepVariant Case Study uses the simplest, but least efficient, computing configuration for DeepVariant. At scale, DeepVariant can be run far more efficiently by following a few simple guidelines:
-
The work performed by
make_examples
can be split across on any number of machines, using all NCPU processes on each machine. For example, to run efficiently onM
machines, each withC
cores, the number of shards in the output file should beN = M*C
. Each process should pass as its output file flag--examples output@N
, with the--task
flag set uniquely for each process (e.g., for corec
on machinem
, where0 <= c < C
and0 <= m < M
), the task should bem*C + c
. The number of jobs / shards should be balanced by the time needed to start up a running instance of DeepVariant (especially when localizing the inputs to the instance). -
make_examples
with--task i
processes every ith interval on the genome. This means that each process will access the entire genome (in pieces). This provides excellent load-balancing across tasks, but it means that each process needs access to the entire reference and BAM file. Localization costs can be overcome by directly accessing files on cloud storage, or splitting up the inputs by chromosome and then processing those with--task
and--regions
set as appropriate. Note that the DeepVariant team doesn't recommend going down the per-chromosome BAM route, as direct cloud storage access is a more attractive long-term approach. -
call_variants
can read any number of TFRecord files frommake_examples
. As in the whole genome case study, we runmake_examples
64 ways parallel, writing anexamples@64
file. We then runcall_variants --examples examples@64
to evaluate all of the examples across all 64 example files. Given the sub-linear scaling with CPUs, the most cost-efficient way to runcall_variants
is either on a single GPU instance or one with 4 CPUs.
- Run
make_examples
on a large number of machines, reading the BAM and reference genome from cloud storage directly, to a sharded output fileexamples@N
. - Run multiple
call_variants
jobs on 4-8 CPU / 1 GPU machines, providing each separatecall_variants
job a subset of the example outputs. For example, ifN
separatecall_variants
jobs are run, each jobJ
readsexamples-0000J-of-0000N
, and writes output tocalls-0000J-of-0000N
). - Run
postprocess_variants
when allcall_variants
jobs have finished, merging thecalls@N
into a single VCF on a 30 GB (or so) machine with 1-2 cores.
- Use a job manager that supports preemptable VMs (i.e., can detect a machine
going down and restart the dead job on a new machine) for
make_examples
. - Explore the optimal cost trade-off between preemptable CPU/GPU instances of
call_variants
. - Run
postprocess_variants
on a preemptable VM as noted above.
To be totally concrete, suppose we wanted to rerun the Case Study but distributed across many machines.
- Run
make_examples
on 8 64 core machines, each runningmake_examples
withparallel
or equivalent with 64 processes on each machine. This would complete in roughly 45 minutes each. The output TFRecords should be written directly to cloud storage. - Run
call_variants
on 8 8-CPU/1-GPU instances, each processing 64 shards of the 8x64 sharded examples. This completes in roughly 75 minutes. The inputs should be read directly from cloud storage and the outputs should be written directly to cloud storage. An alternative CPU-only workflow would be to run on 64 4-CPU machines. - Run a single
postprocess_variants
on a 2 core 30 GB machine, completing in 30-45 minutes.
It's possible to run on more machines for make_examples
and call_variants
if
a shorter turn-around-time is needed. At some point scaling limits will be hit,
either due to localizing BAMs, startup costs on the machines, reading too many
sharded inputs, etc. The optimal configuration for minimizing cost or turnaround
time is currently an open question and likely depends on input data (e.g. whole
genome vs exome, sequencing coverage, etc.).
Note: All of the specific machine type recommendations here are based on empirical results on Google Cloud Platform and are likely to be different on different machine types. Moreover, the optimal configuration may change over time as DeepVariant, TensorFlow, and the CPU, GPU, and TPU hardware evolves.
As of v0.7, DeepVariant accepts CRAM files as input in addition to BAM files. CRAM files encoded without reference compression or those with embedded references just work with DeepVariant. However, CRAM files using an external reference version may not work out-of-the-box. If the path to the original reference (encoded in the file's "UR" tag) is accessible on the machine directly or is already in the local genome cache, then DeepVariant should just work. Otherwise those files cannot be used with DeepVariant at this time. Consequently, we don't recommend using CRAM files with external reference files, but instead suggest using read sequence compression with embedded reference data. (This has a minor impact of file size, but significantly improves file access simplicity and safety.)
For more information about CRAM, see the
samtools
documentation in general
but particularly the sections on
Global Options and
reference sequences in CRAM.
htslib
also hosts a nice page
benchmarking CRAM with information
on the effect of different CRAM options on file size and encoding/decoding
performance.
Here are some basic file size and runtime numbers for running a single
make_examples
job on the case-study data in BAM and CRAM format (with embedded
references) on a BAM/CRAM file for all of chromosome 20 and running on
20:10mb-11mb:
Filetype | Size (Gb) | Runtime (min) |
---|---|---|
BAM | 2.1G | 7m53.589s |
CRAM | 1.2G | 10m37.904s |
-------- | --------- | ------------- |
Ratio | 57% | 135% |
gcsfuse allows access to files in the GCS bucket much like local files.
Install gcsfuse package as explained here (recommended).
First ensure your "application default" credential for GCP is available. The easiest way to set this up is to run gcloud tool.
gcloud auth application-default login
Then mount the bucket to a dir on your local machine.
mkdir -p $local_dir # E.g. local_dir=$HOME/input
gcsfuse $gcs_bucket_name $local_dir # E.g. gcs_bucket_name=deepvariant
Confirm it is mounted successfully by doing ls
on $local_dir
. If there is a
problem, you may want to unmount and mount again.
fusermount -u $local_dir # Linux
umount $local_dir # OS X
Now you can run DeepVariant as if the input files are stored locally under
$local_dir
.
Currently gcsfuse does not work well with DeepVariant in multi-tasking scenarios , i.e. one gcsfuse daemon and multiple DeepVariant tasks (see the issue). There are two workarounds for this:
-
Run multiple gcsfuse; one gcsfuse daemon for every DeepVariant task.
for i in `seq 1 $num_task`;do mkdir -p $local_dir_task_$i gcsfuse $gcs_bucket_name $local_dir_task_$i done
-
Reduce read sizes in DeepVariant tasks. For example, we observed that the issue disappears for
make_examples
if hts block size is reduced to128KB
(from default128MB
). Use--hts_block_size
flag for this.
- gcsfuse installing doc
- mounting doc (if you exprienced credentional issue)