Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running Duphold, "expected AD field in snps VCF" #30

Open
johnemajor opened this issue Jan 11, 2020 · 12 comments
Open

Error running Duphold, "expected AD field in snps VCF" #30

johnemajor opened this issue Jan 11, 2020 · 12 comments

Comments

@johnemajor
Copy link

I'm running duphold v 0.2.1

The command I am running is:
~/install_crap/duphold/duphold -v ./results.vcf.gz -b /efs/WGS/data/WGS/ILMN_exptA/b37/KC_downsampledBAM_and_VCF/NA24385/40x/S7508/NA24385.40x.S7508.aligned.deduped.sort.bam -f /efs/WGS/data/reference/human/human_g1k_v37_modified.fasta/human_g1k_v37_modified.fasta -s ./x.vcf.gz -t 96 -o ./duphold.vcf

It spins for a while, then returns the error:
"expected AD field in snps VCF"

the thing is, my VCF files have the AD field present..... I created a small vcf to test this more directly, and here are the full contents of the x.vcf.gz

`##fileformat=VCFv4.3
##reference=human_g1k_v37_modified
##octopus=<version=0.6.3-beta_HEAD_5961a546,command="octopus --reference /home/kcibul/wgs_resources/data/reference/human/human_g1k_v37_modified.fasta
/human_g1k_v37_modified.fasta --reads NA24385.40x.S7508.aligned.deduped.sort.bam -t regions.bed --forest-file /home/kcibul/wgs_resources/data/referen
ce/forests/DC/germline.v0.7.0.forest -o NA24385.40x.S7508.octopus.tmp.vcf.gz --threads 192 -X 10000MB -B 38400 MB --sequence-error-model /home/kcibul
/wgs_resources/data/reference/octopus_err_models/novaseq.4a38e55.model --annotations AD ADP AF ARF BQ CRF DP FRF GC GQ MC MF MQ MQ0 QD QUAL SB STR_LE
NGTH STR_PERIOD --max-indel-errors 32 --duplicate-read-detection-policy AGGRESSIVE --max-haplotypes=400 --min-forest-quality=8",options="--allow-mark
ed-duplicates no --allow-octopus-duplicates no --allow-pileup-candidates-from-likely-misaligned-reads no --allow-qc-fails no --allow-reads-with-good-
decoy-supplementary-alignments no --allow-reads-with-good-unplaced-or-unlocalized-supplementary-alignments no --allow-secondary-alignments no --allow
-supplementary-alignments no --annotations[0] AD --annotations[1] ADP --annotations[2] AF --annotations[3] ARF --annotations[4] BQ --annotations[5] C
RF --annotations[6] DP --annotations[7] FRF --annotations[8] GC --annotations[9] GQ --annotations[10] MC --annotations[11] MF --annotations[12] MQ --
annotations[13] MQ0 --annotations[14] QD --annotations[15] QUAL --annotations[16] SB --annotations[17] STR_LENGTH --annotations[18] STR_PERIOD --asse
mble-all no --assembler-mask-base-quality 10 --backtrack-level NONE --bad-region-tolerance NORMAL --bamout-type MINI --caller population --consider-u
nmapped-reads no --contig-output-order REFERENCE_INDEX --contig-ploidies[0] Y=1 --contig-ploidies[1] chrY=1 --contig-ploidies[2] MT=1 --contig-ploidi
es[3] chrM=1 --denovo-filter-expression QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF \

0.5 | MP < 30 | MQ0 > 2 --denovo-indel-mutation-rate 1e-09 --denovo-snv-mutation-rate 1.3e-08 --denovos-only no --disable-adapter-masking no --disa
ble-assembly-candidate-generator no --disable-call-filtering no --disable-denovo-variant-discovery no --disable-downsampling no --disable-inactive-fl
ank-scoring no --disable-overlap-masking no --disable-pileup-candidate-generator no --disable-read-preprocessing no --disable-repeat-candidate-genera
tor no --dont-model-mapping-quality no --dont-protect-reference-haplotype no --downsample-above 1000 --downsample-target 500 --dropout-concentration
2 --duplicate-read-detection-policy AGGRESSIVE --extension-level NORMAL --fallback-kmer-gap 10 --fast no --filter-expression QUAL < 10 | MQ < 10 | MP
< 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1 --forest-file germline.v0.7.0.forest --good-base-quality 20 --haplotype-holdout-threshold 2500 --hap
lotype-overflow 200000 --ignore-unmapped-contigs no --indel-heterozygosity 0.0001 --keep-unfiltered-calls no --kmer-sizes[0] 10 --kmer-sizes[1] 15 --
kmer-sizes[2] 20 --lagging-level NORMAL --mask-3prime-shifted-soft-clipped-heads no --mask-inverted-soft-clipping no --mask-soft-clipped-bases no --m
ask-soft-clipped-boundary-bases 2 --max-assemble-region-overlap 200 --max-bubbles 30 --max-clones 3 --max-genotypes 5000 --max-haplotypes 400 --max-h
oldout-depth 20 --max-indel-errors 32 --max-joint-genotypes 1000000 --max-open-read-files 250 --max-phylogeny-size 3 --max-read-length 10000 --max-re
ference-cache-footprint 10GB --max-region-to-assemble 600 --max-somatic-haplotypes 2 --max-variant-size 2000 --max-vb-seeds 12 --min-bubble-score 2 -
-min-candidate-credible-vaf-probability 0.75 --min-clone-frequency 0.01 --min-credible-somatic-frequency 0.01 --min-denovo-posterior 3 --min-expected
-somatic-frequency 0.03 --min-forest-quality 8 --min-good-bases 20 --min-kmer-prune 2 --min-mapping-quality 20 --min-phase-score 10 --min-pileup-base
-quality 20 --min-protected-haplotype-posterior 100 --min-refcall-posterior 2 --min-somatic-posterior 0.5 --min-variant-posterior 1 --model-posterior
UnknownType(N7octopus7options20ModelPosteriorPolicyE) --no-adapter-contaminated-reads no --no-reads-with-decoy-supplementary-alignments no --no-read
s-with-distant-segments no --no-reads-with-unmapped-segments no --no-reads-with-unplaced-or-unlocalized-supplementary-alignments no --normal-contamin
ation-risk LOW --num-fallback-kmers 10 --one-based-indexing no --organism-ploidy 2 --output NA24385.40x.S7508.octopus.tmp.vcf.gz --read-linkage PAIRE
D --reads[0] NA24385.40x.S7508.aligned.deduped.sort.bam --refcall-block-merge-quality 10 --refcall-filter-expression QUAL < 2 | GQ < 20 | MQ < 10 | D
P < 10 | MF > 0.2 --reference human_g1k_v37_modified.fasta --regions-file regions.bed --resolve-symlinks no --sequence-error-model /home/kcibul/wgs_r
esources/data/reference/octopus_err_models/novaseq.4a38e55.model --sites-only no --snp-heterozygosity 0.001 --snp-heterozygosity-stdev 0.01 --somatic
-credible-mass 0.9 --somatic-filter-expression QUAL < 2 | GQ < 20 | MQ < 30 | SMQ < 40 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 |
FRF > 0.5 --somatic-indel-mutation-rate 1e-06 --somatic-snv-mutation-rate 0.0001 --somatics-only no --split-long-reads no --target-read-buffer-footp
rint 38.4kB --temp-directory-prefix octopus-temp --threads 192 --tumour-germline-concentration 1.5 --use-filtered-source-candidates no --use-independ
ent-genotype-priors no --use-preprocessed-reads-for-filtering no --use-same-read-profile-for-all-samples no --use-uniform-genotype-priors no --varian
t-discovery-mode ILLUMINA --very-fast no">
##INFO=<ID=RFQUAL_ALL,Number=1,Type=Float,Description="Combined quality score for call using product of sample RFQUAL">
##INFO=<ID=STR_PERIOD,Number=1,Type=String,Description="Period of overlapping STR">
##INFO=<ID=STR_LENGTH,Number=1,Type=String,Description="Length of overlapping STR">
##INFO=<ID=QD,Number=1,Type=String,Description="QUAL divided by DP">
##INFO=<ID=MQ0,Number=1,Type=String,Description="Number of reads overlapping the call with mapping quality zero">
##INFO=<ID=MQ,Number=1,Type=String,Description="Mean mapping quality of reads overlapping the call">
##INFO=<ID=GC,Number=1,Type=String,Description="GC bias of the reference in a window centred on the call">
##INFO=<ID=CRF,Number=1,Type=String,Description="Fraction of clipped reads covering the call">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Combined depth across samples">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position on CHROM">
##INFO=<ID=MP,Number=1,Type=Float,Description="Model posterior">
##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality (phred-scaled)">
##FORMAT=<ID=DP,Number=1,Type=String,Description="Number of read overlapping the call">
##FORMAT=<ID=MQ,Number=1,Type=Integer,Description="RMS mapping quality">
##FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set">
##FORMAT=<ID=PQ,Number=1,Type=Integer,Description="Phasing quality">
##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">
##FORMAT=<ID=ADP,Number=1,Type=String,Description="Number of reads overlapping the position that could be assigned to an allele">
##FORMAT=<ID=AF,Number=1,Type=String,Description="Empirical minor allele frequency of ALT alleles (AD / ADP)">
##FORMAT=<ID=ARF,Number=1,Type=String,Description="Fraction of reads overlapping the call that cannot be assigned to a unique haplotype">
##FORMAT=<ID=BQ,Number=1,Type=String,Description="Median base quality of reads supporting each allele">
##FORMAT=<ID=FRF,Number=1,Type=String,Description="Fraction of reads filtered for calling">
##FORMAT=<ID=MC,Number=1,Type=String,Description="Number of mismatches at variant position in reads supporting variant haplotype">
##FORMAT=<ID=MF,Number=1,Type=String,Description="Fraction of reads with mismatches at variant position">
##FORMAT=<ID=SB,Number=1,Type=String,Description="Strand bias of reads based on haplotype support">
##FORMAT=<ID=RFQUAL,Number=1,Type=Float,Description="Empirical quality score from random forest classifier">
##FORMAT=<ID=FT,Number=1,Type=String,Description="Sample genotype filter indicating if this genotype was 'called'">
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RF8.6,Description="All filters passed">
##FILTER=<ID=RF,Description="Random Forest filtered">
##contig=<ID=1,length=249250621>
##contig=<ID=2,length=243199373>
##contig=<ID=3,length=198022430>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=4,length=191154276>
##contig=<ID=5,length=180915260>
##contig=<ID=6,length=171115067>
##contig=<ID=7,length=159138663>
##contig=<ID=8,length=146364022>
##contig=<ID=9,length=141213431>
##contig=<ID=10,length=135534747>
##contig=<ID=11,length=135006516>
##contig=<ID=12,length=133851895>
##contig=<ID=13,length=115169878>
##contig=<ID=14,length=107349540>
##contig=<ID=15,length=102531392>
##contig=<ID=16,length=90354753>
##contig=<ID=17,length=81195210>
##contig=<ID=18,length=78077248>
##contig=<ID=19,length=59128983>
##contig=<ID=20,length=63025520>
##contig=<ID=21,length=48129895>
##contig=<ID=22,length=51304566>
##contig=<ID=X,length=155270560>
##contig=<ID=Y,length=59373566>
##contig=<ID=MT,length=16569>
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA24385.40x.S7508
1 10583 . G A 201.31 PASS RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=1;A
N=2;DP=65;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583:100:16:76:0.211:0.074:37:0.198:0:0.000
:0.244:2.76:RF
1 10616 . CCGCCGTTGCAAAGGCGCGCCG C 356.94 RF;RF8.6 RFQUAL_ALL=4.51;STR_PERIOD=5;STR_LENGTH=12;QD=15.519;MQ0=25;MQ=30.110
;GC=0.710;CRF=0.222;AC=2;AN=2;DP=23;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:23:42:10583:100:81:81:1.00
0:0.580:.:0.716:47:0.580:.:4.51:RF
1 12783 . G A 1271 RF;RF8.6 RFQUAL_ALL=7.23;STR_PERIOD=7;STR_LENGTH=15;QD=18.420;MQ0=353;MQ=15.787;GC=0.588;CRF=0
.000;AC=2;AN=2;DP=69;NS=1;AD=150 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|1:110:69:24:12783:100:150:150:1.000:0.000:37:0.6
08:0:0.000:.:7.23:RF
`

I ran pyvcf on this VCF file, and the FORMAT AD field appears in it.
I ran duphold on VCF files generated by Dragen and Sentieon, they both return the same error.

Any advice for how to fix this? I am eager to apply duphold to a clinical product I am developing, but this roadblock has me blocked.

Thanks!
John Major

@brentp
Copy link
Owner

brentp commented Jan 11, 2020

hi, you can run duphold without a snp vcf and just find changes in depth in your SVs. your SNP vcf has:

##INFO=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">

which does not follow spec. It should be:

##INFO=<ID=AD,Number=R,Type=Integer,Description="allele depths">

@johnemajor
Copy link
Author

My SNP VCFs actually have the AD field in the FORMAT section, not the INFO section. AD should be in the FORMAT section as it could be used for a multi-sample VCF.....

But, I tried hacking my snp VCF to have the INFO header you name above and moving the AD field to INFO, I get the same error.

@brentp
Copy link
Owner

brentp commented Jan 22, 2020

shoot. I mean FORMAT, not INFO. what does your VCF have for AD in FORMAT?

@brentp
Copy link
Owner

brentp commented Jan 22, 2020

oh, I see your AD is:

##FORMAT=<ID=AD,Number=1,Type=String,Description="Minor empirical alt allele depth">

should be Number=R, type=Integer and describe the depths for each allele.

@johnemajor
Copy link
Author

Actually, AD is specified as both an INFO field AND a FORMAT field. The INFO version is presumably the sum of all the FORMAT ones, although I don’t see why that’s particularly useful.

@johnemajor
Copy link
Author

But, yes, mine is in FORMAT. And it should appear as :
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="allele depths"> . ?

@brentp
Copy link
Owner

brentp commented Jan 22, 2020

the FORMAT is the one that duphold uses and yes, that is correct. However, you can't simply change the header, you'll have to adjust the values for every record as well (or use a VCF from another caller)

@johnemajor
Copy link
Author

johnemajor commented Jan 23, 2020

The variant FORMAT record AD fields contain integers.... how would they need to be adjusted?

1 10583 . G A 201.31 RF;RF8.6 RFQUAL_ALL=2.76;STR_PERIOD=0;STR_LENGTH=0;QD=3.097;MQ0=25;MQ=46.104;GC=0.710;CRF=0.111;AC=
1;AN=2;DP=65;NS=1 GT:GQ:DP:MQ:PS:PQ:AD:ADP:AF:ARF:BQ:FRF:MC:MF:SB:RFQUAL:FT 1|0:201:65:51:10583: 100:16:76:0.211:0.074:37:0.198:0:0.000:0.244:2.7
6:RF

The AD entry is a valid integer.....

@brentp
Copy link
Owner

brentp commented Jan 23, 2020

AD should have multiple values for each samples. for a bi-allelic variant, it should have 2 values, the first indicates the number of reads supporting the reference allele and the 2nd indicates the number of reads supporting the alternate.

I still suggest to run duphold without the snp vcf as it's not needed and doesn't add much for most cases.

@johnemajor
Copy link
Author

Well, it appears that just changing the header did the trick. Duphold is running now that I modified the snp.vcf to have
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="allele depths">

@brentp
Copy link
Owner

brentp commented Jan 23, 2020

it won't give you correct results.

@johnemajor
Copy link
Author

ok- will give it a shot
Thank You

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants