Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in HaplotypeCaller 'evidence provided is not in sample' #6586

Closed
gbggrant opened this issue May 6, 2020 · 7 comments · Fixed by #6593
Closed

Bug in HaplotypeCaller 'evidence provided is not in sample' #6586

gbggrant opened this issue May 6, 2020 · 7 comments · Fixed by #6593

Comments

@gbggrant
Copy link
Collaborator

gbggrant commented May 6, 2020

Instructions

In one of the six samples that the DSP pipelines team ('lantern') uses for scientific testing, found bug in GATK 4.1.7.0's HaplotypeCaller. 'java.lang.IllegalArgumentException: evidence provided is not in sample'. Full stack trace below.

This is found for Sample NA17-308, Shard 49.
https://cromwell.gotc-dev.broadinstitute.org/api/workflows/1/83938362-b9b5-49f3-a65d-715065d6eabd/metadata
Execution bucket is:
broad-gotc-dev-cromwell-execution (results will stay there for 30 days before being automatically cleaned up).


Bug Report

Affected tool(s) or class(es)

HaplotypeCaller

Affected version(s)

  • 4.1.7.0

Description

Stack trace:
java.lang.IllegalArgumentException: evidence provided is not in sample
at org.broadinstitute.hellbender.utils.genotyper.AlleleLikelihoods.lambda$removeEvidence$9(AlleleLikelihoods.java:1124)
at java.util.stream.ReferencePipeline$4$1.accept(ReferencePipeline.java:210)
at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:546)
at java.util.stream.AbstractPipeline.evaluateToArrayNode(AbstractPipeline.java:260)
at java.util.stream.IntPipeline.toArray(IntPipeline.java:504)
at org.broadinstitute.hellbender.utils.genotyper.AlleleLikelihoods.removeEvidence(AlleleLikelihoods.java:1128)
at org.broadinstitute.hellbender.utils.genotyper.AlleleLikelihoods.contaminationDownsampling(AlleleLikelihoods.java:315)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerGenotypingEngine.assignGenotypeLikelihoods(HaplotypeCallerGenotypingEngine.java:173)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCallerEngine.callRegion(HaplotypeCallerEngine.java:608)
at org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller.apply(HaplotypeCaller.java:210)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:200)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1048)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:163)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:206)
at org.broadinstitute.hellbender.Main.main(Main.java:292)

Steps to reproduce

Tell us how to reproduce this issue. If possible, include command lines that reproduce the problem. (The support team may follow up to ask you to upload data to reproduce the issue.)

Expected behavior

Tell us what should happen

Actual behavior

Tell us what happens instead


Feature request

Tool(s) or class(es) involved

Tool/class name(s), special parameters?

Description

Specify whether you want a modification of an existing behavior or addition of a new capability.
Provide examples, screenshots, where appropriate.


Documentation request

Tool(s) or class(es) involved

Tool/class name(s), parameters?

Description

Describe what needs to be added or modified.


@droazen droazen added this to the GATK-Triaged-Issues milestone May 6, 2020
@droazen
Copy link
Contributor

droazen commented May 6, 2020

@vruano and @davidbenjamin , it looks like we have another HaplotypeCaller bug in the latest release (4.1.7.0). It seems that this wasn't caught prior to release due to the wrong version of the GATK docker image being run during full-scale testing. Would you be able to look into this when you have a chance? Thanks!

@vruano , it appears that you may have last touched this code in commit d9fd22f, so possibly you have some insight into what might be going on here?

@ldgauthier
Copy link
Contributor

Just as a note, we included that sample because it reportedly has ~25% contamination. Not sure if the contamination downsampling comes into play at, but it is something that's distinctive about this bam.

@davidbenjamin
Copy link
Contributor

I may have a lead.

The error occurs here (no insight so far, just looking up the line from the stack trace):

final Object2IntMap<EVIDENCE> evidenceIndexes = evidenceIndexBySampleIndex(sampleIndex);
        final int[] indexesToRemove = evidences.stream().mapToInt(e -> {
            final int index = evidenceIndexes.getInt(e);
            if (index == MISSING_INDEX) {
                throw new IllegalArgumentException("evidence provided is not in sample");
            }

We get an error when evidenceIndexBySampleIndex(sampleIndex) yields a Map that for some reason doesn't contain a read that it should. So let's investigate evidenceIndexBySampleIndex().

This method returns the evidenceIndexBySampleIndex.get(sampleIndex) field if it is not null (ie uninitialized); otherwise it fills it and then returns it. The code for filling it seems fine, and it explicitly loops over every sample read, so it's hard to see that the error could come from there. It seems rather that the problem is in returning the cached value whenever it is not null.

The cached value of evidenceIndexBySampleIndex.get(sampleIndex) becomes invalid whenever reads are added or removed. However, you can check all the accesses of evidenceIndexBySampleIndex (there are only six) and verify that the class never accounts for this.

So, suppose that an AlleleLikelihoods object invokes evidenceIndexBySampleIndex(sampleIndex) more than once and adds or removes reads between these. The second call returns the cached map from the first call, which is bogus. Even if it doesn't explain this issue, it is a bug.

Now let's think about which public methods evidenceIndexBySampleIndex(sampleIndex) is called in and where this occurs in HaplotypeCaller:

  • addEvidence (in HC this happens only in the likelihoods for annotations, downstream of our issue, so this is not the culprit).
  • filterPoorlyModeledEvidence (this happens after Pair-HMM to the haplotype likelihoods, so not the culprit either)
  • contaminationDownsampling
  • retainEvidence (hmmm in HC readAlleleLikelihoods.retainEvidence(variantCallingRelevantOverlap::overlaps); occurs immediately before contaminationDownsampling)
  • indexOfEvidence (nothing stands out)

@davidbenjamin
Copy link
Contributor

davidbenjamin commented May 7, 2020

Now, it seems like calling contaminationDownsampling right after retainEvidence could cause problems if both methods remove reads.

However, one might correctly point out that although the cache invalidation I mentioned is not handled systematically, the method removeEvidenceByIndex does have some code to update the evidence by sample and the evidence index map. It's possible that this code is totally fine and that this lead is a dead end. However, the code looks like it could be simpler and it's tough to parse. For example, try to track the to variable, which determines the determination of the outer for loop:

                for (int etrIndex = 1, to = nextIndexToRemove, from = to + 1; to < newEvidenceCount; etrIndex++, from++) {
                    if (etrIndex < evidencesToRemove.length) {
                        nextIndexToRemove = evidencesToRemove[etrIndex];
                        evidenceIndex.remove(evidences.get(nextIndexToRemove));
                    } else {
                        nextIndexToRemove = oldEvidenceCount;
                    }
                    for (; from < nextIndexToRemove; from++) {
                        final EVIDENCE evidence = evidences.get(from);
                        evidences.set(to, evidence);
                        evidenceIndex.put(evidence, to++);
                    }
                }

@davidbenjamin
Copy link
Contributor

@gbggrant Could I have a path for the bam and a minimal command line to reproduce the error?

@gbggrant
Copy link
Collaborator Author

gbggrant commented May 8, 2020

@davidbenjamin
The bam is at:
gs://broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/83938362-b9b5-49f3-a65d-715065d6eabd/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/caa978f3-a54e-4051-b673-e10610b734db/call-GatherBamFiles/NA17-308.bam

Here's the command line:

gatk --java-options "-Xms6000m -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" \
  HaplotypeCaller \
  -R /cromwell_root/gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta \
  -I gs://broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/83938362-b9b5-49f3-a65d-715065d6eabd/call-UnmappedBamToAlignedBam/UnmappedBamToAlignedBam/caa978f3-a54e-4051-b673-e10610b734db/call-GatherBamFiles/NA17-308.bam \
  -L /cromwell_root/broad-gotc-dev-cromwell-execution/ExomeGermlineSingleSample/83938362-b9b5-49f3-a65d-715065d6eabd/call-BamToGvcf/VariantCalling/0425b6b4-e7d1-4178-ac52-cca1170687b9/call-ScatterIntervalList/glob-cb4648beeaff920acb03de7603c06f98/9scattered.interval_list \
  -O NA17-308.g.vcf.gz \
  -contamination 0.3472946666666667 \
  -G StandardAnnotation -G StandardHCAnnotation -G AS_StandardAnnotation \
  -GQB 10 -GQB 20 -GQB 30 -GQB 40 -GQB 50 -GQB 60 -GQB 70 -GQB 80 -GQB 90 \
  -ERC GVCF

The metadata for the workflow is at:
https://cromwell.gotc-dev.broadinstitute.org/api/workflows/1/83938362-b9b5-49f3-a65d-715065d6eabd/metadata

Let me know if you need more information.

@davidbenjamin
Copy link
Contributor

Thanks @gbggrant! @droazen @ldgauthier I just pushed a branch that resolves the error by simplifying the evidence-to-index cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants