-
Notifications
You must be signed in to change notification settings - Fork 39
Preparing VCF Files
The requirements for the VCF input file for PharmCAT are described in the VCF requirement doc. To summarize, for a VCF file to work with PharmCAT the following issues must be addressed:
- Must be aligned to grc38
- 'chr' must be used as chromosome identifiers
- All positions are needed, even if 0/0 or ./.
- The filter "PASS" tag is not considered so all variants are used
- Variant representation format must match the CPIC definitions
The following is our current best practices for solving each of these issues.
PharmCAT requires vcf files aligned to grc38. Liftover can be problematic, due to ambiguity in positions, and also the fact that many tools won't update the genotype. For instance, if you use CrossMap it will update the position and reference nucleotides, however if the reference in now the same as alternate it will remove the variant from the file. Likewise, NCBI Remap will not update genotypes, so even if the updated position means the variant should be called as 0/0 if will not be updated from 1/1. Presently there are only a few positions where this matters, but this will need to be very carefully checked. See the bottom of this page for information on how to solve this using remap and the experimental vcf checker. Currently we recommend aligning to grc38 from the very beginning - for example see our pharmacogenomics NGS pipeline.
PharmCAT expects vcf files to begin with "chr". If for some reason your reference does not have these they can be updated using the following unix command:
perl -pe '/^((?!^chr).)*$/ && s/^([^#])/chr$1/gsi' merged_output.vcf > merged_output.chrfixed.vcf
All positions are expected to be present in the file, even if they are 0/0 (reference) or ./. (missing). This is different from a typical vcf file which only usually contains variant sites.
Missing positions can be added by the following way using the GATK 'EMIT_ALL_SITES', either during the pipeline, or after using:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
-R grc38.reference.fasta -I input.bam -o output.vcf \
-L pgx.interval.file.list -glm BOTH --output_mode EMIT_ALL_SITES
with the following pgx.interval.file.list.
PharmCAT considers all variants in your file, even if they fail filters. If you use a tool like VSQR you can use the following Perl one-liner to change the genotype to 0/0 if necessary:
perl -pe '/^((?!PASS).)*$/ && /^((?!0\/0).)*$/ && /^((?!0\|0).)*$/ && s/[0-9][\/|][0-9]/0|0/' merged_output.vcf > merged_output_pass.vcf
Variant representation is an on-going problem in NGS (for instance see: https://macarthurlab.org/2014/04/28/converting-genetic-variants-to-their-minimal-representation/). The following vcf lines all contain the same variant:
chr7 117548628 . GTTTTTTTA GTTTTTTTA,GTTTTTA . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . GTT G . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . G . . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . G(T)7A G(T)5A . PASS CFTR:5T GT 0/0
Different NGS pipelines, the ways files are created (for instance if a multisample file is split), and post-processing software tools all lead to these differences. As PharmCAT is directly matching these strings to what is in the definition files this can cause problems. PharmCAT expects to find deletions as the ref="ATCT" alt=".", rather than ref="A" alt=".". Therefore you will need to replace all the deletions within in the file. If the ref is a single letter it means no variant was found, so it's safe to replace it with the appropriate nucleotide string. The following update.sh uses VIM on the command line to make the required changes. You will need to visually check each line. If the genotype is not 0/0, or there is another variant present, you will need to edit MANUALLY the file to match what PharmCAT expects. Currently we are still exploring the diversity of possible values in the indels and repeat regions. If you see something unusual please message us.
There is currently a highly experimental application within PharmCAT that will check and prepare your grc38 vcf file, including checking the genotyping of liftovers (via NCBI Remap only), chromosome naming, and indel representation. It is an interactive command line tool, as we are still unsure of what to expect at the indel sites. If you see an unusual deletion or repeat you may still need to edit the vcf by hand using a text editor. The tool can be run the following way after preparing the shadowJar:
java -cp build/libs/pharmcat-*-all.jar org.pharmgkb.pharmcat.definition.VcfChecker -d <definition_dir> -i <input vcf> -o <output.vcf> -g hg38 -p
It can also be run using the following bash script. Note it adds missing positions as star one grc38 reference calls; we also heavily recommend you don't do this, but add the missing positions from the real data. However this script in conjunction with the vcf checker tool can be used as a starting point for automating analysis of a 'typical' VCF file.
#!/usr/bin/env bash
# Use: ./prepare.sh input.vcf.gz
# Convert to grc38 using ncbi remap - http://www.ncbi.nlm.nih.gov/genome/tools/remap/docs/api
# This will add UPDATE_REF to the info field updated positions, that can be used by VcfChecker to flip the genotypes (as this is not done automatically).
# perl remap_api.pl --mode asm-asm --from GCF_000001405.25 --dest GCF_000001405.26 --annotation input.vcf --annot_out extract-grc38.vcf
file=$1
# use bcftools to split a multipart vcf if present - https://samtools.github.io/bcftools/bcftools.html
for sample in `bcftools view -h $file | grep "^#CHROM" | cut -f10-`; do
echo $sample
bcftools view -I -s $sample $file > ${file/.vcf*/.$sample.vcf}
# using a perl one liner to fix chr file starting
echo "fixing chr"
perl -pe '/^((?!^chr).)*$/ && s/^([^#])/chr$1/gsi' ${file/.vcf*/.$sample.vcf} > ${file/.vcf*/.$sample.grc38.chr.vcf}
# PharmCAT Does not look at the filter field. If you want to exclude non-PASS variants use below:
# # perl -pe '/^((?!PASS).)*$/ && /^((?!0\/0).)*$/ && /^((?!0\|0).)*$/ && s/[0-9][\/|][0-9]/0|0/' ${file/.vcf*/.$sample.grc38.chr.vcf} > ${file/.vcf*/.$sample.grc38.chr.vcf}
# VcfChecker can then be used to check for missing positions, match up the indels, and check for liftover flips
echo "Run PharmCAT vcf check. Missing positions added as reference calls"
java -cp pharmcat-**-all.jar org.pharmgkb.pharmcat.definition.VcfChecker -d generatedDefinitions/ -i ${file/.vcf*/.$sample.grc38.chr.vcf} -o ${file/.vcf*/.$sample.grc38.clean.vcf} -g hg38 -p -m
done