Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal/Inconsistent checking of variant order in vcf writing tools, particularly affecting SelectVariants #6443

Open
1 of 2 tasks
kachulis opened this issue Feb 6, 2020 · 2 comments

Comments

@kachulis
Copy link
Contributor

kachulis commented Feb 6, 2020

Bug Report

Affected tool(s) or class(es)

particularly SelectVariants, but really anything that writes out a vcf.

Affected version(s)

  • Latest public release version [version?]
  • Latest master branch as of [2/6/2020]

Description

There are minimal/inconsistent checks that variants added to vcfWriters are correctly ordered (an issue with htsjdk I think). This leads to an insidious bug, where a sorted vcf can be fed through SelectVariants, and depending on the flavor of output vcf, either crash, or succeed but output an incorrectly sorted vcf. The issue is that in some circumstances SelectVariants will trim alleles to their minimal representation, which can change the location of a variant record, and thus reorder them. However, SelectVariants does nothing to account for the potential order change. Since vcfWriter implementations in htsjdk seem to do minimal/inconsistent checks on the order of variants being added to them, this may write out an incorrectly sorted vcf, or throw an exception, depending on the flavor of vcfWriter.

Steps to reproduce

With attached (zipped because github) vcf, run
gatk SelectVariants -V test.input.vcf -sn SAMPLE_01 -O test.output.vcf
Tool will succeed, but output vcf will be incorrectly sorted. Somehow, this incorrectly sorted vcf will also be accompanied by an index! Though if you try to run IndexFeatureFile on the output vcf separately, it will fail.

run
gatk SelectVariants -V test.input.vcf -sn SAMPLE_01 -O test.output.vcf.gz
tool will throw exception w/ stack trace:

java.lang.IllegalArgumentException: Features added out of order: previous (TabixFeature{referenceIndex=0, start=17148456, end=17148456, featureStartFilePosition=2460, featureEndFilePosition=-1}) > next (TabixFeature{referenceIndex=0, start=17148447, end=17148457, featureStartFilePosition=2509, featureEndFilePosition=-1})
	at htsjdk.tribble.index.tabix.TabixIndexCreator.addFeature(TabixIndexCreator.java:89)
	at htsjdk.variant.variantcontext.writer.IndexingVariantContextWriter.add(IndexingVariantContextWriter.java:203)
	at htsjdk.variant.variantcontext.writer.VCFWriter.add(VCFWriter.java:242)
	at org.broadinstitute.hellbender.tools.walkers.variantutils.SelectVariants.apply(SelectVariants.java:620)
	at org.broadinstitute.hellbender.engine.VariantWalker.lambda$traverse$0(VariantWalker.java:104)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
	at java.util.Iterator.forEachRemaining(Iterator.java:116)
	at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
	at org.broadinstitute.hellbender.engine.VariantWalker.traverse(VariantWalker.java:102)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
	at org.broadinstitute.hellbender.Main.main(Main.java:292)

test.input.vcf.zip

Also, there don't seem to be any engine checks that the input vcf used to drive a variant walker is correctly sorted. That is, SelectVariants (for example) can be run on an input vcf which is not correctly sorted, without raising any sort of complaint. I suppose this is slightly less troubling than the output ordering issue, as it falls under "garbage in, garbage out".

@ldgauthier
Copy link
Contributor

@kachulis can you give an example where trimming changes the position? This seems to violate some of the assumptions we've been making for the BQ joint calling work.

@kachulis
Copy link
Contributor Author

kachulis commented Feb 7, 2020

@ldgauthier this happens when the alleles have shared bases on the left. For example
chr1 17148447 . GAAACAAAACA GAAACAAAATA
trims to
chr1 17148456 . C T
I think this most commonly arises when splitting alleles at multiallelic sites. For example the site above came from splitting the multiallelic site
chr1 17148447 . GAAACAAAACA GAAACAAAACAAAACAAAACA,GAAACAAAATA,G

@droazen droazen added this to the GATK-Priority-Backlog milestone Feb 24, 2020
@droazen droazen removed this from the GATK-Priority-Backlog milestone Jun 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants