Skip to content

Latest commit

 

History

History
192 lines (152 loc) · 7.81 KB

readme.org

File metadata and controls

192 lines (152 loc) · 7.81 KB

vcfpp: a single C++ file for manipulating VCF/BCF

https://github.com/Zilong-Li/vcfpp/actions/workflows/linux.yml/badge.svg https://github.com/Zilong-Li/vcfpp/actions/workflows/mac.yml/badge.svg https://img.shields.io/badge/Documentation-latest-blue.svg https://img.shields.io/github/v/release/Zilong-Li/vcfpp.svg https://img.shields.io/github/license/Zilong-Li/vcfpp?style=plastic.svg

This project introduces vcfpp (vcf plus plus), a single C++ file as interface to the basic htslib, which can be easily included in a C++ program for scripting high-performance genomic analyses. Check out vcfppR as an example of how vcfpp.h can facilitate rapidly writing high-performance R package.

Features:

  • single file to be easily included and compiled
  • easy and safe API to use.
  • objects are RAII. no worry about allocate and free memory.
  • the full functionalities of htslib, e.g. supports of compressed VCF/BCF and URL link as filename.
  • compatible with C++11 and later

Table of Contents

Installation

  1. install htslib on your system
  2. download the released vcfpp.h
  3. put vcfpp.h in the same folder as your cpp source file or a folder say /my/writable/path/ or the system path

Usage

The documentation of API is here.

Reading VCF

In this example, we count the number of heterozygous genotypes for each sample in all records. You can paste the example code into a example.cpp file and compile it by g++ example.cpp -std=c++11 -O3 -Wall -I. -lhts. You can replace -I. with -I/my/writable/path/ if you put vcfpp.h there.

#include "vcfpp.h"
using namespace std;
using namespace vcfpp;
int main(int argc, char* argv[])
{
    // read data from 1000 genomes server
    BcfReader vcf("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20220422_3202_phased_SNV_INDEL_SV/1kGP_high_coverage_Illumina.chr21.filtered.SNV_INDEL_SV_phased_panel.vcf.gz");
    BcfRecord var(vcf.header); // construct a variant record
    vector<char> gt; // genotype can be bool, char or int type
    vector<int> hetsum(vcf.nsamples, 0);
    while (vcf.getNextVariant(var)) {
        var.getGenotypes(gt);
        if (!var.isSNP() || !var.isNoneMissing()) continue; 
        assert(var.ploidy()==2); // make sure it is diploidy
        for(int i = 0; i < gt.size() / 2; i++) 
            hetsum[i] += abs(gt[2 * i + 0] - gt[2 * i + 1]);
    }
    for (auto i : hetsum) { cout << i << endl; }
    return 0;
}

Genotypes Coding

There are 3 types used for genotypes, ie. vector<bool>, vector<char> and vector<int>. One can use vector<bool> and vector<char> for memory-effcient goal. The downside is that it only stores 0 and 1. And vector<int> can store missing values and multialleles.

Code genotypes with missing allele as heterozygous.

If you use vector<bool> and vector<char> to store the genotypes, then there is no way to represent missing values. Hence the returned genotypes always have 0s and 1s. And genotypes with missing allele (eg. 0/., ./0, 1/., ./1, ./.) are codes as 1/0. It’s recommended to use var.isNoneMissing() to check if there is missing value.

Code missing allele as -9

If this default behavior for vector<bool> and vector<char> is not what you want, you should use vector<int> to store the genotypes, then any missing allele will be coded as -9. Note you should take the missing value -9 into account for downstream analysis.

Writing VCF

There are many ways for writing the VCF/BCF file.

Use an empty template

Here we construct an initial BCF with header using VCF4.3 specification. Next we add meta data in the header and write out variant record given a string.

BcfWriter bw("out.bcf.gz", "VCF4.3");
bw.header.addFORMAT("GT", "1", "String", "Genotype");
bw.header.addINFO("AF", "A", "Float", "Estimated allele frequency in the range (0,1)");
bw.header.addContig("chr20"); // add chromosome
for (auto& s : {"id01", "id02", "id03"}) bw.header.addSample(s); // add 3 samples
bw.writeLine("chr20\t2006060\trs146931526\tG\tC\t100\tPASS\tAF=0.000998403\tGT\t1|0\t1|1\t0|0");

Use an existing VCF as template

In this example, we first read VCF file test/test-vcf-read.vcf.gz. Second, we construct a variant record and update the record with the input VCF. Third, we construct a BcfWriter object using the meta data in the header of the input VCF, writing out the header and the modified variant record.

BcfReader br("test/test.vcf.gz");
BcfWriter bw("out.vcf.gz", br.header);
BcfRecord var(br.header);
br.getNextVariant(var);
bw.writeHeader();
var.setPOS(100001); // update the POS of the variant
bw.writeRecord(var);

In some cases where we want to modify the samples of a template VCF, we can associate the BcfWriter and BcfRecord with another modifiable header instead of the non-modifiable header as the above.

BcfReader br("test/test.vcf.gz");
BcfWriter bw("out.vcf.gz");
// copy header of template vcf and restrict to target sample
bw.copyHeader("test/test.vcf.gz", "HG00097");
BcfRecord var(bw.header);
br.getNextVariant(var);
var.setPOS(100001);
bw.writeRecord(var); // output only sample HG00097

Variants Operation

All variants related API can be found BcfRecord. The commonly used are listed below.

BcfReader vcf("bcf.gz"); // construct a vcf reader
BcfRecord var(vcf.header); // construct an empty variant record associated with vcf header
vcf.getNextVariant(var) // get next variant
vector<char> gt; // genotype can be bool, char or int type
var.getGenotypes(gt), var.setGenotypes(gt); // get or set genotypes for current variant
var.isNoneMissing(); // check if there is missing value after getting genotypes
vector<int> gq; // genotype quality usually is of int type
var.getFORMAT("GQ",gq), var.setFORMAT("GQ",gq); // get or set a vector of genotypes quality 
vector<int> pl; // Phred-scaled genotype likelihoods usually is of int type
var.getFORMAT("PL",pl); // get a vector of Phred-scaled genotype likelihoods
float af;
var.getINFO("AF", af), var.setINFO("AF", af); // get or set AF (allele frequency) value in INFO
int mq;
var.getINFO("MQ",mq) // get MQ (Average mapping quality) value from INFO
vector<int> dp4; // Number of high-quality ref-forward , ref-reverse, alt-forward and alt-reverse bases
var.getINFO("DP4", dp4), var.setINFO("DP4", dp4); // get or set a vector of dp4 value from INFO
var.isSNP(); // check if variant is SNP
var.isSV(); // check if variant is SV
var.isIndel(); // check if variant is indel
var.isMultiAllelic(); // check if variant is MultiAllelic
var.POS(), var.setPOS(); // get POS or modify POS

Header Operation

All variants related API can be found in BcfHeader.

Working with R

Examples of vcfpp working with R are in folder Rcpp and https://github.com/Zilong-Li/vcfppR.

Working with Python

Examples of vcfpp working with Python are in folder Pybind11.

Command Line Tools

Find more useful command line tools in folder tools.