Skip to content

Basic Operations

Adam Novak edited this page Oct 26, 2018 · 20 revisions

Basic operations

It is useful to think of vg as a toolbox for manipulating the internal variation graph representation. Almost any goal can be expressed in terms of operations on the graph; while developing this way of thinking initially requires some mental gymnastics, in the end it makes the user's life simpler by abstracting their problem out of the biological context and into a more general one. For example, variant calling can be expressed as the series of construct, index, map, and call operations, with inputs and outputs along the way.

        vg construct -r reference.fa -v variants.vcf.gz > graph.vg # Construct
        vg index -x graph.xg -g graph.gcsa -k 11 graph.vg          # Index
        vg map -f reads.fq -g graph.gcsa -x reads.xg > mapped.gam  # Map
        vg augment -a pileup -Z samp.trans -S samp.support graph.vg mapped.gam > samp.aug.vg # Pileup/Augment
        vg call -z samp.trans -s samp.support -b graph.vg samp.aug.vg > calls.vcf # Call

Let's walk through a basic vg pipeline like the one above step-by-step to help illustrate what all it can do.

First things first: build your graph

There are two main ways to build a graph in vg. If you have a reference genome, you should generally start with vg construct. If you don't have a reference genome but do have some polished long reads, you could try a progressive assembly using vg msga. Two short command line examples follow.

        vg construct -r reference.fa -v variants.vcf.gz > graph.vg
        vg msga -f polished_contigs.fa > graph.vg

At its most basic, construct takes only a reference genome with the -r flag. This generates a linear graph, completely analogous to the linear reference used for input.

        vg construct -r reference.fa > ref.vg # a 'flat' graph, in vg parlance. One that contains no bubbles.

While all of the operations vg will operate on this graph, it wouldn't be particularly interesting (bwa could do the same things). What we would like to do is use the prior information from a population (e.g. as variant calls in a VCF) as a prior for mapping our reads. We hope that by doing so we'll get better mappings and better rediscovery of known variants. We can create such a graph like so if we have a VCF containing calls to the same reference we use as input:

      # We need to gzip compress and tabix-index our VCF, like so:
      # bgzip variants.vcf && tabix variants.vcf.gz
      vg construct -r reference.fa -v variants.vcf.gz > graph.vg

We'll save the details of msga for another wiki page, but in short we can imagine it doing the following steps:

  1. Make a 'flat' graph of the first read.
  2. Align the second read to this graph.
  3. Incorporate new paths introduced by the second read into the graph.
  4. Reindex the graph and continue the process with reads 3, 4, 5...N.

We've now seen how to create graphs, both flat ones and those containing reference variation. Now we'll talk about indexing, an essential step if we want to map reads.

Index your graph

Indexing a graph is directly analogous to indexing a linear reference with bwa index; we need to create ways to efficiently look up entities in the graph if we want to map reads.

There are two main kinds of index used by vg map, one used by vg genotype, and an old (semi-deprecated) type that we'll discuss.

The xg index

xg is a library that can represent the same information in a variation graph - nodes, edges, and paths - but in much less space. This allows us to fit most or even all of the graph structure in memory, speeding up the process of mapping reads. It also allows us to efficiently query paths in the graph, for example in the gPBWT where we may want to retrieve haplotypes (which are stored as paths). We need the xg index for mapping, and we construct it with the following command:

            vg index -x <name>.xg graph.vg

The GCSA2 index

GCSA2 stands for "Generalized Compressed Suffix Array", but you can think of it as a graph BWT index like that used by bwa. It allows fast lookup of the locations in the graph where specific subsequences are present - for example, if I wanted to know where the sequence "GATTACA" was in my graph, I could query the GCSA2 index, and it would provide me the location of the "GATTACA" sequence. The GCSA2 index requires a kmer size parameter as it's based on a pruned de Bruijn graph. We can construct a GCSA2 index like so:

            vg index -g index.gcsa -k 11 graph.vg

Mapping reads

Okay, we've now created a graph and generated our indices for mapping reads, so let's map some. There are a few ways to map reads, so we'll briefly review those before jumping in to recommended usage.

The old way of mapping used a kmer-based mapping scheme. This is pretty old-school and isn't as accurate as modern approaches based on maximal exact matches (a.k.a. MEMs). As such, this mapper has been more or less deprecated in favor of using MEMs, and we'll give examples using only MEMs.

To map reads with MEMs, we call vg map without passing a kmer size. Single end reads are passed with a single -f; paired ends require a -f for each file containing reads, like so:

          ## Single end reads
          vg map -x index.xg -g index.gcsa -f singles.fq > reads.gam
          ## Paired end reads
          vg map -x index.xg -g index.gcsa -f reads1.fq -f reads2.fq > reads.gam

The mappings are output in GAM format, an analogue of BAM for graphs.

Calling variants with call

There are two variant callers in vg. The vg call variant caller relies on a samtools pileup style structure, but is more well-tuned and highly tested, and is currently the recommended variant caller. There is also vg genotype, which uses a more FreeBayes-like approach to variant calling, but it is harder to get good results with.

vg call deserves its own wiki page, so we'll just show some quick commands here. Briefly the graph is first augmented with variants suggested by the read pileup using vg augment. vg call then uses this new graph and the statistics about the read pileup to generate variant calls.

         vg augment -a pileup -Z samp.trans -S samp.support graph.vg mapped.gam > samp.aug.vg
         vg call -z samp.trans -s samp.support -b graph.vg samp.aug.vg > calls.vcf

Calling variants with genotype instead

vg genotype relies on the graph itself and an index of the GAM that allows for efficient access to reads by graph position. It works by inserting all paths implied by reads into the graph and then looking for bubbles, then doing some logic to determine which side(s) of the bubble the reads support.

The GAM index

The GAM index allows us to find which alignments in our GAM map to specific nodes/edges and vice versa. We need this index for genotyping, as we perform this query in that command's source code. Here's how we make one:

           vg index -d mapped.gam.index -N reads.gam

This will produce an index call reads.gam.index, which shows up on your file system as a directory.

Actually genotyping

You can run vg genotype like this:

         vg genotype -v graph.vg mapped.gam.index/ > calls.vcf

This will barf a bunch of info to the terminal that relates to support for individual calls. It will also produce a VCF, but the coordinates won't be sorted, so you might want to sort them with vcflib.

At this point, you've gone from a reference genome, some population variation and a set of reads to genotyped calls.

Clone this wiki locally