-
Notifications
You must be signed in to change notification settings - Fork 39
Preparing VCF Files
The requirements for the VCF input file for PharmCAT are described in the VCF requirement doc. To summarize, for a VCF file to work with PharmCAT the following issues must be addressed:
- Must be aligned to grc38
- 'chr' must be used in chromosome identifiers (e.g.
chr17
instead of17
) - All positions are needed, even if 0/0 or ./.
- The filter "PASS" tag is not considered so all variants are used
- Variant representation format must match the CPIC definitions
The following is our current best practices for solving each of these issues.
PharmCAT requires vcf files aligned to grc38. Liftover can be problematic, due to ambiguity in positions, and also the fact that many tools won't update the genotype. For instance, if you use CrossMap it will update the position and reference nucleotides, however if the reference in now the same as alternate it will remove the variant from the file. Likewise, NCBI Remap will not update genotypes, so even if the updated position means the variant should be called as 0/0 if will not be updated from 1/1. Presently there are only a few positions where this matters, but this will need to be very carefully checked. See the bottom of this page for information on how to solve this using remap and the experimental vcf checker. Currently we recommend aligning to grc38 from the very beginning - for example see our pharmacogenomics NGS pipeline. NB - we have discovered another issue with NCBI remap, which is that some of the alternates can be lost following remap. For instance:
Pre Liftover: 2 234668879 rs57191451 CAT C,CATAT,CATATAT 0/3
Post Liftover:2 233760233 rs57191451 CAT CATAT,CATATAT 0/3
We currently do not have a fix for this (NCBI have been contacted), so again you will have to check these sites carefully by hand.
PharmCAT expects vcf files to begin with "chr". If for some reason your reference does not have these they can be updated using the following unix command:
perl -pe '/^((?!^chr).)*$/ && s/^([^#])/chr$1/gsi' merged_output.vcf > merged_output.chrfixed.vcf
All positions are expected to be present in the file, even if they are 0/0 (reference) or ./. (missing). This is different from a typical vcf file which only usually contains variant sites.
Missing positions can be added by the following way using the GATK 'EMIT_ALL_SITES', either during the pipeline, or after using:
java -jar GenomeAnalysisTK.jar -T HaplotypeCaller \
-R grc38.reference.fasta -I input.bam -o output.vcf \
-L pgx.interval.file.list -glm BOTH --output_mode EMIT_ALL_SITES
with the following pgx.interval.file.list.
PharmCAT considers all variants in your file, even if they fail filters. If you use a tool like VSQR you can use the following Perl one-liner to change the genotype to 0/0 if necessary:
perl -pe '/^((?!PASS).)*$/ && /^((?!0\/0).)*$/ && /^((?!0\|0).)*$/ && s/[0-9][\/|][0-9]/0|0/' merged_output.vcf > merged_output_pass.vcf
Variant representation is an on-going problem in NGS (for instance see: https://macarthurlab.org/2014/04/28/converting-genetic-variants-to-their-minimal-representation/). The following vcf lines all contain the same variant:
chr7 117548628 . GTTTTTTTA GTTTTTTTA,GTTTTTA . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . GTT G . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . G . . PASS CFTR:Reference,CFTR:5T GT 0/0
chr7 117548628 . G(T)7A G(T)5A . PASS CFTR:5T GT 0/0
Different NGS pipelines, the ways files are created (for instance if a multisample file is split), and post-processing software tools all lead to these differences. As PharmCAT is directly matching these strings to what is in the definition files this can cause problems. PharmCAT expects to find deletions as the ref="ATCT" alt=".", rather than ref="A" alt=".". Therefore you will need to replace all the deletions within in the file. If the ref is a single letter it means no variant was found, so it's safe to replace it with the appropriate nucleotide string. We are currently experimenting with using vcf-merge to do this by matching the PharmCAT positions (in the form of a VCF file) with the input VCF:
vcf-merge sample.vcf pgx.vcf.gz > output.vcf
pgx.vcf.gz can be obtained from PharmCAT using the following command:
java -cp pharmcat-*-all.jar org.pharmgkb.pharmcat.definition.ExtractPositions -o pgx.vcf
bgzip pgx.vcf && tabix -p vcf pgx.vcf.gz
bcftools sort -o pgx_sorted.vcf.gz -O z pgx.vcf.gz && tabix -p vcf pgx_sorted.vcf.gz
You will need to check the output file manually to make sure the correct substitutions have been made. Currently we are still exploring the diversity of possible values in the indels and repeat regions. If you see something unusual please message us.
There is currently an experimental docker script within the PharmCAT preprocessor that will check and prepare your grc38 vcf file by carrying out chromosome renaming, adding missing calls as reference (which we heavily advise you don't do in production), and indel representation. See the ReadMe.md in preprocessor folder for details. The prepare.sh would also be a good starting point for developing your own PharmCAT data preparation scripts.