This project introduces vcfpp (vcf plus plus), a single C++ file as interface to
the basic htslib
, which can be easily included in a C++ program for scripting
high-performance genomic analyses. Check out vcfppR as an example of how vcfpp.h
can facilitate rapidly writing high-performance R package.
Features:
- single file to be easily included and compiled
- easy and safe API to use.
- objects are RAII. no worry about allocate and free memory.
- the full functionalities of
htslib
, e.g. supports of compressed VCF/BCF and URL link as filename. - compatible with C++11 and later
- install htslib on your system
- download the released vcfpp.h
- put
vcfpp.h
in the same folder as your cpp source file or a folder say/my/writable/path/
or the system path
The documentation of API is here.
In this example, we count the number of heterozygous genotypes for each
sample in all records. You can paste the example code into a
example.cpp
file and compile it by g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts
.
You can replace -I.
with -I/my/writable/path/
if you put vcfpp.h
there.
#include "vcfpp.h"
using namespace std;
using namespace vcfpp;
int main(int argc, char* argv[])
{
// read data from 1000 genomes server
BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
BcfRecord var(vcf.header); // construct a variant record
vector<char> gt; // genotype can be bool, char or int type
vector<int> hetsum(vcf.nsamples, 0);
while (vcf.getNextVariant(var)) {
var.getGenotypes(gt);
if (!var.isSNP() || !var.isNoneMissing()) continue;
assert(var.ploidy()==2); // make sure it is diploidy
for(int i = 0; i < gt.size() / 2; i++)
hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
}
for (auto i : hetsum) { cout << i << endl; }
return 0;
}
There are 3 types used for genotypes, ie. vector<bool>
, vector<char>
and vector<int>
. One can use vector<bool>
and vector<char>
for
memory-effcient goal. The downside is that it only stores 0 and 1. And
vector<int>
can store missing values and multialleles.
If you use vector<bool>
and vector<char>
to store the genotypes, then
there is no way to represent missing values. Hence the returned
genotypes always have 0s and 1s. And genotypes with missing allele
(eg. 0/.
, ./0
, 1/.
, ./1
, ./.
) are codes as 1/0
. It’s recommended to
use var.isNoneMissing()
to check if there is missing value.
If this default behavior for vector<bool>
and vector<char>
is not what
you want, you should use vector<int>
to store the genotypes, then any
missing allele will be coded as -9
. Note you should take the missing
value -9
into account for downstream analysis.
There are many ways for writing the VCF/BCF file.
Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.
BcfWriter bw("out.bcf.gz", "VCF4.3");
bw.header.addFORMAT("GT", "1", "String", "Genotype");
bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)");
bw.header.addContig("chr20"); // add chromosome
for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples
bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");
In this example, we first read VCF file test/test-vcf-read.vcf.gz
. Second,
we construct a variant record and update the record with the input
VCF. Third, we construct a BcfWriter object using the meta data in the
header of the input VCF, writing out the header and the modified variant
record.
BcfReader br("test/test.vcf.gz");
BcfWriter bw("out.vcf.gz", br.header);
BcfRecord var(br.header);
br.getNextVariant(var);
bw.writeHeader();
var.setPOS(100001); // update the POS of the variant
bw.writeRecord(var);
In some cases where we want to modify the samples of a template VCF, we
can associate the BcfWriter
and BcfRecord
with another modifiable header
instead of the non-modifiable header as the above.
BcfReader br("test/test.vcf.gz");
BcfWriter bw("out.vcf.gz");
// copy header of template vcf and restrict to target sample
bw.copyHeader("test/test.vcf.gz", "HG00097");
BcfRecord var(bw.header);
br.getNextVariant(var);
var.setPOS(100001);
bw.writeRecord(var); // output only sample HG00097
All variants related API can be found BcfRecord. The commonly used are listed below.
BcfReader vcf("bcf.gz"); // construct a vcf reader
BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header
vcf.getNextVariant(var) // get next variant
vector<char> gt; // genotype can be bool, char or int type
var.getGenotypes(gt), var.setGenotypes(gt); // get or set genotypes for current variant
var.isNoneMissing(); // check if there is missing value after getting genotypes
vector<int> gq; // genotype quality usually is of int type
var.getFORMAT("GQ",gq), var.setFORMAT("GQ",gq); // get or set a vector of genotypes quality
vector<int> pl; // Phred-scaled genotype likelihoods usually is of int type
var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods
float af;
var.getINFO("AF", af), var.setINFO("AF", af); // get or set AF (allele frequency) value in INFO
int mq;
var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO
vector<int> dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases
var.getINFO("DP4", dp4), var.setINFO("DP4", dp4); // get or set a vector of dp4 value from INFO
var.isSNP(); // check if variant is SNP
var.isSV(); // check if variant is SV
var.isIndel(); // check if variant is indel
var.isMultiAllelic(); // check if variant is MultiAllelic
var.POS(), var.setPOS(); // get POS or modify POS
All variants related API can be found in BcfHeader.
Examples of vcfpp working with R are in folder Rcpp and https://github.com/Zilong-Li/vcfppR.
Examples of vcfpp working with Python are in folder Pybind11.
Find more useful command line tools in folder tools.